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a recursive  rather  than  iterative  representation  of  a computation  avoids  the 
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tation within  the  architecture.  For  this  reason,  a program  executed  by  the 
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Chapter  1 
INTRODUCTION 

I. 1 Architecture  of  Parallel  Systems 

Highly  parallel  computer  systems  have  evolved  in  a manner  which  often 
necessitates  the  placing  of  unusual  constraints  on  programs  and  data.  Par- 
allel machines  such  as  the  Illiac  IV  [7]  and  the  CDC  Star  [18]  can  realize 
their  full  potential  only  for  data  represented  in  array  or  vector  format. 

In  these  architectures,  the  programmer  is  forced  to  use  such  data  represen- 
tations, even  when  inappropriate,  to  achieve  highly  parallel  execution. 

A number  of  methods  have  been  developed  to  exploit  simultaneous  or 
concurrent  operation,  however,  the  implementation  of  these  techniques  within 
a traditional  von  Neumann  architecture  has  not  fully  utilized  their  poten- 
tial. This  applies  both  to  the  various  procedures  for  increasing  the  per- 
formance of  a single  processor  and  those  for  exploiting  multiple  processors 
in  a computer  system. 

Three  techniques  are  currently  popular  for  increasing  the  parallel  ac- 
tivity within  a single  processor.  These  are: 

1.  pipelining  of  operations 

2.  overlapped  memory  access 

3.  instruction  lookahead 

The  pipelining  of  an  arithmetic  operation  distributes  the  performance 
of  the  operation  over  time  rather  than  space.  That  is,  rather  than  utiliz- 
ing several  functional  units  of  a specific  type  to  increase  the  processing 
rate,  one  larger  functional  unit  is  employed,  and  the  operation  is  broken 
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into  a number  of  smaller  operations  which  are  performed  simultaneously  upon 
a stream  of  values.  Although  the  performance  of  a single  operation  can  ac- 
tually take  longer  in  a pipelined  functional  unit,  the  fact  that  a large 
number  of  operations  are  being  performed  concurrently  can  produce  a very 
high  processing  rate. 

In  order  to  fully  utilize  the  technique  of  pipelining,  the  data  must 
be  represented  as  a vector;  if  there  are  gaps  in  the  stream  of  values  sup- 
plied to  the  pipeline,  the  processing  rate  can  actually  be  decreased  from 
that  of  a single  conventional  functional  unit.  Current  stream-oriented 
processors  as  the  CDC  Star  [18]  and  the  TI  ASC  [34]  do  not  have  the  capa- 
bility to  form  data  into  streams,  that  burden  must  be  born  by  the  programmer. 

The  technique  of  overlapped  memory  access  merely  extends  the  concept 
of  pipelining  to  the  fetching  of  instructions  from  memory.  If  the  memory 
of  a computer  system  is  interleaved;  that  is,  if  the  memory  is  divided  into 
a number  of  sections,  and  the  instructions  and  data  of  a program  are  dis- 
tributed over  the  sections,  then  several  items  can  be  accessed  simultaneous- 
ly. If  the  instruction?  of  a program  are  arranged  so  consecutive  instruc- 
tions are  contained  in  separate  memories,  then  instruction  fetching  can  be 
pipelined,  and  instructions  can  be  supplied  at  a very  fast  rate.  However, 
a problem  arises  when  a conditional  is  encountered  because  the  system  does 
not  know  which  of  the  set  of  possible  succeeding  instructions  to  fetch  until 
after  the  conditional  has  been  executed. 

The  use  of  instruction  lookahead  in  a processor  allows  the  exploitation 
of  multiple  arithmetic  units  by  decomposing  the  instruction  stream  into  in- 
dependent elements.  For  example,  consider  the  arithmetic  expression  A + B + 


(C  * D) . The  two  computations  A + B and  C * D can  be  performed  simulta- 
neously in  separate  functional  units.  The  IBM  360  model  91  {5,  33]  and 
the  CDC  6600  [32]  have  developed  techniques  for  exploiting  this  property 


for  short  instruction  sequences;  however,  once  again,  any  branching  in  the 


program  will  disrupt  the  flow  of  instructions  to  the  functional  units  and 
decrease  the  processing  capability  of  the  architecture. 

In  illustration  of  the  problems  encountered  in  exploiting  these  tech- 
niques, consider  the  IBM  360/91  [16].  The  functional  capability  of  this 
processor  is  70  million  instructions  per  second  (MIPS).  However,  the  in- 
struction decoder  can  only  supply  instructions  at  a rate  of  16  MIPS  using 
the  technique  of  lookahead.  An  average  incidence  of  conditional  instruc- 
tions reduces  the  performance  of  the  processor  to  6 MIPS.  Thus,  the  pro- 
cessing capability  of  the  architecture  cannot  be  fully  realized,  and  with 
the  lookahead  of  eight  instructions  which  is  used,  it  is  difficult  to  have 
an  adequate  instruction  mix  to  fully  utilize  the  multiple  functional  units. 

Another  common  technique  for  increasing  the  performance  of  a processor 
is  that  of  separating  the  memory  system  of  the  processor  into  levels  of  mem- 
ory. This  allows  the  expensive  fast  memory,  known  as  a cache,  to  contain 
only  the  most  active  instructions  and  data.  In  order  to  exploit  this  tech- 
nique, instructions  and  data  are  organized  into  blocks  known  as  pages,  and 
these  pages  are  transferred  between  levels  of  storage.  The  utilization  of 
pages  is  wasteful  in  that  often  not  all  space  in  the  page  is  filled  (known 
as  breakage).  Also,  in  order  to  reference  one  item  on  a page,  the  entire 
page  must  be  transferred  to  the  cache,  moving  a great  deal  of  unnecessary 
information.  In  a multiple  processor  environment  another  problem  arises. 
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that  of  several  copies  of  a page  existing  in  the  caches  of  different  pro- 
cessors. If  one  copy  is  altered,  the  o er  processors  will  not  have  knowl- 
edge of  that  fact. 

In  addition  to  increasing  the  performance  of  a single  processor,  there 
has  been  a trend  toward  connecting  several  processors  together  to  form  a 
multiprocessor  system.  Such  a system  can  be  either  homogeneous;  that  is, 
each  processor  can  be  replaced  by  any  other,  or  non-homogeneous , each  pro- 
cessor has  its  own  special  function.  However,  in  either  case  there  is  a 
degradation  in  performance  of  a multiprocessor  system  from  the  possible 
performance  of  the  individual  processors  working  on  separate  problems  due 
to  memory  conflict  and  the  necessary  operating  system.  It  is  claimed  that, 
in  general,  for  a two  processor  system,  2.2  times  as  much  hardware  is  re- 
quired as  for  the  two  separate  processors,  and  the  resulting  performance  is 
1.8  times  the  performance  of  a single  processor  [2]. 

Homogeneous  multiprocessor  systems  such  as  array  and  pipeline  processors 
have  the  drawback  that  they  can  only  achieve  their  full  capability  for  special 
data  representations,  as  has  been  previously  discussed.  Let  us  consider  more 
conventional  methods  of  connecting  processors  in  a multiprocessor  system. 

There  are  three  techniques  which  are  commonly  utilized: 

1.  time-shared  or  common  bus 

2.  crossbar  switch 

3.  multiport  memory 


A time-shared  bus  provides  a single  path  interconnecting  all  units  of 
a system  (Figure  1.1).  Any  unit  which  wishes  to  transfer  data  to  another 
unit  must  obtain  control  of  the  bus,  transfer  the  desired  information,  and 


Figure  1.2.  Structure  of  a crossbar  switch. 
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then  release  the  bus.  This  technique  is  the  cheapest,  least  complex,  and 
easiest  to  modify.  However,  the  capability  of  the  entire  system  is  limited 
by  the  bus  transfer  rate,  and  the  expansion  of  the  system  may  actually  de- 
grade the  overall  system  performance.  Also,  this  system  is  the  least  fault 
tolerant,  failure  of  the  bus  has  catastrophic  results. 

Extending  this  concept  by  adding  more  buses  to  the  system  to  the  point 
where  each  memory  has  a bus  which  can  be  connected  to  any  processor,  we  re- 
alize a structure  known  as  a crossbar  switch  (Figure  1.2).  Although  the 
possibility  of  conflict  still  exists  in  this  configuration,  it  is  possible 
for  a number  of  transfers  to  occur  simultaneously.  This  structure  has  the 
potential  for  the  highest  system  transfer  rate,  and  expansion  of  the  system 
should  improve  its  overall  efficiency.  However,  the  crossbar  switch  is  a 
complex  and  expensive  device  and  can  utilize  as  much  as  25  per  cent  of  the 
system  hardware  [8].  Also,  the  complexity  of  the  switch  grows  as  the  pro- 
duct of  the  number  of  processors  and  memories. 

An  interconnection  technique  which  is  often  used  in  large  time  sharing 
systems  is  that  of  the  multiport  memory  structure.  The  memory  modules  of 
the  system  have  more  than  one  access  port  and  contain  control  circuitry  to 
resolve  conflicts  between  requests  received  on  separate  ports  (Figure  1.3). 

No  special  switching  circuitry  is  required  in  this  configuration.  However, 
the  memory  units  are  very  expensive,  and  the  size  of  the  system  and  config- 
uration options  are  limited  by  the  number  and  type  of  memory  ports  available. 

The  methods  of  structuring  multiple  processor  systems  and  improving  the 
performance  of  a processor  all  have  serious  drawbacks  to  the  full  exploita- 
tion of  the  capabilities  of  the  processors.  In  this  thesis  an  approach  to 
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the  structuring  of  a computer  system  which  offers  attractive  solutions  to 
many  of  these  problems  is  presented. 

1.2  The  Data-Flow  Approach 

Studies  of  concurrent  operation  within  a computer  system  and  of  the 
representation  of  parallelism  in  a programming  language  have  yielded  a new 
form  of  program  representation,  known  as  data  flow.  Execution  of  a data- 
flow program  is  data-driven;  that  is,  each  instruction  is  enabled  for  exe- 
cution just  when  each  required  operand  has  been  supplied  by  the  execution 
of  a predecessor  instruction.  Data-flow  representations  for  programs  have 
been  described  by  Karp  and  Miller  [19],  Rodriguez  [28],  Adams  [1],  Dennis 
and  Fosseen  [12],  Bahrs  [6],  Kosinski  [20],  and  Dennis  [9]. 

In  order  to  take  advantage  of  the  parallelism  inherent  in  an  elemen- 
tary data-flow  representation,  the  architecture  of  the  elementary  data-flow 
processor  was  developed  by  Dennis  and  Misunas  [13].  The  class  of  programs 
implemented  by  this  processor  incorporates  no  fancy  capabilities  such  as 
recursion,  data  structures,  conditionals,  or  iteration.  However,  the  lan- 
guage and  its  corresponding  architecture  are  well-suited  for  the  represen- 
tation of  signal  processing  computations  such  as  filtering,  waveform  gener- 
ation, and  fast  Fourier  transforms,  in  which  a group  of  operations  is  to  be 
performed  once  for  each  sample  (in  time)  of  the  signals  being  processed. 

In  the  development  of  the  basic  data-flow  processor  we  added  condition- 
al and  iterative  constructs  to  the  language  and  architecture  (14].  Also, 
the  basic  data-flow  architecture  incorporates  a multi-level  memory  fystem 
in  which  the  active  memory  is  operated  as  a cache,  and  individual  instruc- 
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tions  are  retrieved  from  the  auxiliary  memory  as  they  are  required  for  com- 
putation. 

An  extension  of  the  architectural  concepts  to  a general-purpose  com- 
puter, incorporating  procedures,  recursive  program  activation,  and  data 
structures  has  been  completed  recently  [11,  24,  26,  27]  and  is  the  subject 
of  this  thesis. 

The  problems  of  processor  switching  and  memory/processor  interconnec- 
tion are  avoided  within  the  data-flow  architecture  by  the  use  of  intercon- 
nection networks  which  have  a great  deal  of  inherent  parallelism.  Sections 
of  the  machine  communicate  by  means  of  fixed  size  information  packets,  and 
delays  in  packet  transmission  within  the  networks  do  not  affect  the  utili- 
zation of  the  hardware.  The  interconnection  networks  are  large,  but  grow 
at  a much  slower  rate  than  a crossbar  switch  and  require  none  of  the  global 
control  circuitry  necessary  for  the  switch. 

The  structure  of  a data-flow  processor  allows  a large  number  of  instruc- 
tions to  be  active  simultaneously.  These  active  instructions  pass  through 
the  networks  concurrently  and  form  streams  of  instructions  for  the  pipelined 
functional  units. 

The  processor  does  not  utilize  an  instruction  register  or  instruction 
decoder;  an  instruction  proceeds  on  its  own  when  its  operands  are  ready  and 
delivers  its  results  to  other  instructions  which  are  waiting  for  them.  No 
software  operating  system  is  necessary  within  the  architecture.  Processor 
allocation,  the  formation  of  instructions  into  streams  for  the  functional 
units,  and  the  transfer  of  information  between  levels  of  memory  is  efficiently 
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accomplished  by  the  hardware  of  the  machine. 

The  configuration  of  component  parts  in  a data-flow  processor  is  read- 
ily chosen  to  fit  a desired  application,  allowing  the  architecture  to  be 
applicable  to  a wide  variety  of  problems.  The  processor  is  also  expandable, 
and  the  performance  of  the  processor  increases  linearly  with  an  increase 
in  size. 

The  exploitation  of  data  dependencies  in  programs  has  been  investigated 
previously  by  Shaprio,  Saint,  and  Presberg  [30],  Miller  and  Cocke  [22],  Seeber 
and  Lindquist  [31],  and  Rumbaugh  [29].  Indeed,  such  is  the  goal  of  the  look- 
ahead techniques  utilized  in  architectures  such  as  the  IBM  360/91  and  the 
CDC  6600.  The  approach  taken  in  the  data-flow  processor  differs  from  these 
approaches  in  that  it  utilizes  a radically  different  concept  of  computer 
organization  which  offers  attractive  solutions  to  many  of  the  problems  en- 
countered in  adapting  von  Neumann  machines  for  parallel  computation,  an 
architecture  in  which  parallelism  and  concurrency  are  inherent  in  the  struc- 
ture of  the  processor. 

1.3  Outline  of  the  Thesis 

The  organization  of  the  remaining  chapters  of  the  thesis  is  as  follows: 

Chapter  2 presents  the  data-flow  language  chosen  as  the  base  language 
of  the  processor.  A program  in  the  language  is  represented  as  an  acyclic 
directed  graph,  eliminating  the  problems  associated  with  the  execution  of 
cyclic  program  structures  within  a highly  parallel  architecture.  The  data- 
flow language  has  a capability  greater  than  that  of  Algol  60  and  is  a gen- 

T 

eralization  of  pure  Lisp. 
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Chapter  3 describes  the  architecture  of  the  data-flow  processor.  The 
machine  is  presented  in  stages,  examining  first  the  execution  of  instruc- 
tions within  the  processor  and  then  the  implementation  of  data  structures. 
The  structure  of  a multi-level  memory  system  is  considered  next,  and  to 
complete  the  presentation,  the  implementation  of  procedures  is  described. 

Chapter  4 illustrates  the  reason  for  restricting  a program  executed 
within  the  data-flow  processor  to  an  acyclic  directed  graph  representation. 
The  performance  tradeoffs  within  the  processor  between  cyclic  and  acyclic 
program  structures  are  examined  to  demonstrate  the  necessity  of  such  a 
restriction. 

The  final  chapter  summarizes  the  architectural  concepts  and  discusses 
topics  for  future  research. 
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Chapter  2 

THE  DATA-FLOW  LANGUAGE 

The  data-flow  language  presented  in  this  chapter  serves  as  the  base 
language  for  the  architecture  to  be  described  in  the  following  chapter.  The 
semantics  of  the  language  is  identical  to  that  of  the  data-flow  procedure 
language  developed  by  Dennis  [9].  The  only  major  difference  arises  in  the 
manner  in  which  simultaneous  procedure  activations  are  differentiated.  The 
model  proposed  by  Dennis  distinguishes  between  the  data  utilized  by  separate 
activations  of  a procedure  through  the  use  of  a unique  color  associated  with 
all  data  of  each  activation.  The  model  described  in  this  chapter  is  imple- 
memtation  oriented  and  therefore  uses  a technique  similar  to  the  "copy  rule" 
of  Algol. 

2 . 1 Data-Flow  Programs 

A program  in  the  data-flow  language  is  composed  of  two  kinds  of  elements, 
called  actors  and  links.  An  actor  of  the  language  can  be  either  an  operator, 
a decider,  or  a gate  (Figure  2.1).  Each  actor  has  a number  of  input  arcs 
which  supply  values  necessary  for  its  execution  and  one  output  arc  upon  which 
results  are  placed.  A small  dot  or  circle  represents  a link  which  has  one 
input  arc  upon  which  it  receives  results  from  an  actor  and  a number  of  out- 
put arcs  over  which  it  distributes  copies  of  the  result  to  other  actors 
(Figure  2.2). 


Values  are  conveyed  over  the  arcs  of  the  program  by  tokens  which  are 
represented  by  large  solid  dots.  An  actor  with  a token  on  each  of  its  input 
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(e)  merge  (f)  Boolean  operator 


Figure  2.1.  Actors  of  the  data-flow  language. 


A A 

(a)  data  link  (b)  control  link 


Figure  2.2.  Links  of  the  data-flow  language. 


arcs,  and  no  token  on  its  output  arc,  is  enabled  and  sometime  later  will 
fire,  removing  the  tokens  from  its  input  arcs,  computing  a result  using  the 


values  carried  by  the  input  tokens,  and  associating  the  result  with  a token 
placed  on  its  output  arc.  In  a similar  manner,  a link  is  enabled  when  a 
token  is  present  on  its  input  arc,  and  no  token  is  present  on  any  of  its 
output  arcs.  It  fires  by  removing  the  token  from  its  input  arc  and  asso- 
ciating copies  of  the  value  carried  by  the  input  token  with  tokens  placed 
on  its  output  arcs. 

The  data-flow  language  utilizes  two  types  of  tokens:  data  tokens  and 
control  tokens.  A data  token  carries  a data  value  which  is  produced  by  an 
operator  (Figure  2.1a)  as  a result  of  some  arithmetic  operation.  A control 
token  is  generated  at  a decider  (Figure  2.1b)  which,  when  the  decider  re- 
ceives a data  value  on  each  input  arc,  applies  its  associated  predicate  and 
produces  either  a true-  or  false-valued  control  token  on  its  output  arc. 

Control  tokens  direct  the  flow  of  data  tokens  by  means  of  either  a T- 
gate,  F-gate  or  merge  actor  (Figure  2.1c,  d,  e).  A T-gate  will  pass  the 
data  token  on  its  input  arc  to  its  output  arc  when  it  receives  a control 
token  carrying  the  value  true  over  its  control  input  arc.  It  will  absorb 
the  data  token  on  its  input  arc  and  place  nothing  on  its  output  arc  if  a 
false-valued  control  token  is  received.  Similarly,  an  F-gate  will  pass  its 
input  data  token  to  its  output  arc  only  on  receipt  of  a false-valued  token 
on  the  control  input.  Upon  receipt  of  a true -valued  token,  it  will  absorb 
the  data  token. 


A merge  actor  (Figure  2.1e)  has  a true  input,  a false  input,  and  a 
control  input.  It  passes  to  its  output  arc  a data  token  from  the  input  arc 
corresponding  to  the  value  of  the  control  token  received.  Any  tokens  on  the 


-19- 


other  input  are  not  affected. 

In  illustration  of  the  capability  of  the  data-flow  language,  consider 
the  iterative  data-flow  program  in  Figure  2.3  for  the  following  computation: 

input  x,  y 

n :=  0 

while  x>y  do 
x :=  x-y 
n : “ n+1 

end 

output  x,  n 

Upon  exiting  the  body  of  the  iteration,  n is  equal  to  the  original  value 
of  x divided  by  y,  and  x is  equal  to  the  remainder. 

In  the  data-flow  program  the  control  input  arcs  of  the  three  merge  ac- 
tors carry  false -valued  control  tokens  in  the  initial  configuration  to  allow 
the  input  values  of  x and  y and  the  constant  0 to  be  admitted  as  initial  val- 
ues for  the  iteration.  Once  these  values  have  been  received,  the  predicate 
joy  is  tested.  If  it  is  true,  the  value  of  y and  the  new  value  of  x are 
cycled  back  into  the  body  of  the  iteration  through  the  T-gates  and  two  merge 
nodes.  Concurrently,  the  remaining  T-gate  and  merge  node  return  an  incre- 
mented value  of  the  iteration  count  n.  When  the  output  of  the  decider  is 
false , the  current  values  of  x and  n are  delivered  through  the  two  F-gates, 
and  the  initial  configuration  is  restored. 


2.2  Structure  Values 

The  values  conveyed  by  tokens  over  the  arcs  of  a data-flow  program  are 
either  elementary  values  or  structure  va lues , and  each  value  has  an  asso- 
ciated tag  designating  its  type.  The  set  of  elementary  values  E contains 

E = TU  IURUQ 


L. 
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where : 

T = truth  values 
I = integers 
R = reals 
Q = strings 

A structure  value  in  a data-flow  program  is  represented  as  an  acyclic 
directed  graph  having  one  root  node  with  the  property  that  each  node  of  the 
graph  can  be  reached  by  a directed  path  from  the  root  node.  Each  node  of 
the  graph  is  either  a structure  node  or  an  elementary  node.  A structure 
node  serves  as  the  root  node  for  a substructure  of  the  structure  and  con- 
sists of  a set  of  selector-value  pairs 

S = {(Sl>  vL)  ...  (sn>  vn)} 

where 

s e I U Q 

vt  £ E U S U fnill 

and  s^  is  the  selector  of  node  v^.  An  elementary  node  has  no  emanating  arcs; 
rather,  an  elementary  value  is  associated  with  the  node.  A node  with  no  em- 
anating arcs  and  no  associated  elementary  value  has  value  f nill . 

A structure  value  is  represented  by  a data  token  carrying  a unique  point- 
er to  the  root  node  of  the  structure.  In  Figure  2.4  the  structure  a contains 
three  elementary  values  a,  b,  and  c,  designated  by  the  simple  selector  L and 
the  compound  selectors  R*L  and  R-R  respectively.  Structure  node  y of  struc- 
ture a is  shared  with  structure  P and  is  designated  by  a different  selector 
in  p than  in  a. 

A simple  selector  associated  with  a node  can  be  either  on  integer  or  a 
string.  A compound  selector  is  formed  by  the  concatenation  of  a number  of 
simple  selectors  and  specifies  a path  through  the  structure  which  can  be 
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followed  by  applying  the  simple  selectors  in  the  stated  order. 

The  structure  values  of  a data-flow  program  are  located  within  a heap 
[9].  The  state  of  the  program  at  any  point  in  time  is  represented  by  a to- 
ken distribution  on  the  graph  of  the  program  and  a heap.  During  each  exe- 
cution step  of  the  program,  some  link  or  actor  will  fire,  resulting  in  a 
new  token  distribution  and,  in  some  cases,  a modified  heap. 

A node  of  the  heap  is  accessible  to  a program  only  if  some  token  car- 
ries a pointer  to  the  node  or  the  node  can  be  reached  by  a directed  path 
from  some  accessible  node.  Upon  completion  of  an  execution  step  of  a pro- 
gram, any  nodes  of  the  heap  made  inaccessible  by  that  step  are  deleted  to- 
gether with  any  emanating  branches. 

In  order  to  generate  and  perform  operations  upon  structure  values,  a num- 
ber of  new  actors  must  be  defined.  The  structure  actors  presented  herein 
are  not  necessarily  the  only  ones  one  might  desire,  however,  they  provide 
the  necessary  basic  operations  for  the  creation  and  manipulation  of  a struc- 
ture and  serve  to  illustrate  the  manner  in  which  structures  are  handled  with- 
in the  data-flow  architecture.  The  operation  of  each  of  the  structure  actors 
of  the  data-flow  language  is  presented  in  Figures  2.5  to  2.8. 

Structures  are  created  through  use  of  the  construct  actor  (Figure  2.5). 
The  actor  accepts  an  elementary  or  structure  value  from  each  input  and  places 
on  its  output  a structure  containing  the  input  values  as  components.  Each 
input  is  labeled  with  the  selector  in  the  new  structure  to  be  associated  with 
the  value  arriving  on  that  input. 


A value  is  retrieved  from  a structure  by  the  select  actor  (Figure  2.6). 
The  value  in  the  input  structure  designated  by  the  selector  argument  is 
placed  on  the  output  of  the  actor.  The  result  can  be  either  an  elementary 


Figure  2.5.  Operation  of  the  construct  actor 


Figure  2.6.  Operation  of  the  select  actor 
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value  or  a structure  value.  If  the  argument  of  the  actor  is  a multiple  se- 
lector, the  actor  produces  on  its  output  the  value  at  the  end  of  the  path 
designated  by  the  multiple  selector.  The  action  of  the  actor  is  undefined 
if  the  input  structure  does  not  contain  the  specified  selector(s). 

Structure  values  in  a data-flow  program  are  not  modified;  rather,  new 
structure  values  are  created  which  are  modifications  of  the  original  values, 
while  the  original  values  are  preserved.  The  append  and  delete  actors  provide 
the  means  of  creating  these  new  structure  values. 

The  structure  produced  by  the  firing  of  an  append  actor  is  a version  of 
the  input  structure  which  contains  a new  or  modified  component  (Figure  2.7). 

If  the  specified  node  of  the  input  structure  has  a selector  corresponding  to 
the  selector  argument  of  the  actor,  the  value  designated  by  that  selector  in 
the  new  structure  is  the  input  value.  Otherwise  the  specified  selector- 
value  pair  is  added  to  the  node  of  the  new  structure.  Identical  elements 
of  the  input  and  output  structures  are  shared  between  the  two  structures. 

In  a similar  manner,  the  structure  appearing  on  the  output  arc  of  a 
delete  actor  is  a version  of  the  input  structure  in  which  the  specified  node 
contains  one  fewer  component  (Figure  2.8).  The  specified  node  in  the  new 
structure  is  missing  the  selector-value  pair  designated  by  the  selector  arg- 
ument. As  with  the  append  actor,  identical  elements  are  shared  between  the 
input  and  output  structures. 


I 


2.3  Data-Flow  Procedure  Representation 

In  this  section  we  present  an  approach  to  the  description  of  procedures 


i 


within  the  data-flow  language.  Procedures  of  the  language  are  represented 
as  acyclic  directed  graphs  in  a manner  which  is  very  attractive  from  both 
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a semantic  viewpoint  and  an  implementation  viewpoint.  Similar  descriptions 
of  a base  language  in  terms  of  tree  structures  have  been  presented  by  Dennis 
[10],  Amerasinghe  [3],  Ellis  [15],  and  Henderson  [17].  The  application  of 
such  an  approach  to  the  description  of  a data-flow  computation  has  not  been 
previously  considered , yet  the  extension  to  a data-flow  representation  ap- 
pears to  be  very  attractive. 

A data-flow  procedure  is  a data-flow  program  with  a single  input  arc 
over  which  the  argument  arrives  and  a single  output  arc  upon  which  the  re- 
sult is  placed.  The  body  of  a procedure  is  represented  as  a data  structure, 
and  the  procedure  is  referenced  by  a token  carrying  a pointer  to  the  struc- 
tured representation.  Every  procedure  in  the  language  is  determinate;  that 
is,  the  same  result  is  produced  by  every  activation  of  the  procedure  which 
receives  the  same  input  values. 

To  provide  for  procedure  activation  and  termination,  the  apply  and  re- 
turn actors  are  introduced  into  the  data-flow  language.  The  operation  of 
these  actors  is  shown  in  Figure  2.9.  The  apply  actor  receives  two  inputs, 
a procedure  and  an  argument,  which  may  be  either  an  elementary  value  or  a 
structure  value.  Upon  firing,  the  actor  creates  an  argument  structure  of  the 
argument  and  the  destination  for  the  result  of  the  application,  and  this  ar- 
gument structure  is  given  to  the  procedure  as  input.  If  no  instruction  fol- 
lows the  apply  actor  in  the  program,  the  value  designated  by  the  destination 
selector  in  the  argument  structure  passed  to  the  procedure  is  nil.  Upon 
completion  of  the  execution  of  the  procedure,  the  result  is  sent  to  the 
specified  destination  by  a return  actor  within  the  procedure  body. 

The  data-flow  representation  of  the  following  simple  procedure  is  shown 


A 


in  Figure  2.10: 
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Figure  2.9.  Operation  of  the  apply  and  return  actors. 
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Pr  procedure  (x) 

if  x<5  2 

then  return  x 
else  return  x 
end  P 

When  the  procedure  of  Figure  2.10  is  applied,  it  receives  on  its  input  arc 
a structure  containing  two  elements.  The  first  element,  designated  by  the 
selector  arg,  is  the  argument  x of  the  procedure.  The  second  element,  dest, 
is  the  destination  address  for  the  result.  The  procedure  shown  in  Figure 
2.10  has  been  called  with  the  argument  5 and  the  destination  A. 

The  first  operations  performed  by  the  procedure  are  select  operations 
which  send  the  argument  to  the  procedure  body  and  the  destination  address 
to  the  return  instructions.  The  procedure  body  tests  the  argument  to  see 
if  it  is  less  than  five.  If  so,  it  is  squared,  and  the  resulting  value  is 
returned.  If  the  argument  is  greater  than  or  equal  to  five,  the  original 
value  is  returned. 

In  a structured  representation  of  the  procedure,  each  node  of  the  struc- 
ture represents  one  instruction  of  the  program  and  has  selectors  to  specify 
the  operation  code,  each  operand,  any  constants,  and  the  destinations  for 
the  result.  The  destination  selectors  designate  the  nodes  of  the  program 
graph  which  are  to  receive  copies  of  the  result  generated  by  the  instruction. 
Each  destination  selector  also  specifies  the  identifier  of  the  operand  in 
the  destination  which  is  waiting  for  the  result  (i.e.  dl-1  indicates  that 
this  is  the  first  destination  specified  in  the  instruction,  and  the  result 
is  to  go  to  the  first  operand  of  the  designated  destination  instruction). 

A destination  instruction  may  be  shared  if  it  receives  more  than  one  operand. 
A structured  representation  of  the  data-flow  procedure  of  Figure  2.10  is 


not  represented  by  separate  instructions,  but  are  incorporated  into  the  in- 
structions representing  the  return  and  square  actors  as  part  of  the  operand 
specification.  The  value  within  a gated  operand  designated  by  the  selector 
gate  specifies  the  type  of  gate  represented,  and  the  selectors  control  and 
data  designate  the  control  value  and  data  value  received.  This  method  of 
representation  allows  a more  efficient  execution  of  the  instruction  within 
the  data-flow  processor  and  is  discussed  further  in  Section  3.1.1. 

Initially,  all  operands  of  the  program  structure  of  Figure  2.11  have 
value  nil  to  indicate  that  they  are  empty.  An  instruction  is  enabled  for 
execution  when  no  operands  of  the  instruction  contain  the  value  nil . and 
each  control  value  received  matches  the  associated  gate  value. 

Upon  being  enabled,  an  instruction  is  ready  to  be  processed.  Some 
arbitrary  time  later,  the  specified  operation  is  performed  on  the  operands 
of  the  instruction,  and  the  result  is  sent  to  all  instructions  indicated 
by  the  destination  address  selectors  of  the  instruction.  At  each  destina- 
tion, the  result  is  appended  to  the  correct  nil- valued  operand  (designated 
by  the  operand  number  in  the  destination  selector),  and  the  instruction 
containing  that  operand  is  enabled  if  all  operands  are  present  and  the  cor- 
rect control  values  have  been  received.  If  the  destination  operand  is  gated, 
the  type  of  value  received  determines  whether  it  is  to  be  appended  to  the 
node  designated  by  the  selector  control  or  the  one  designated  by  the  selector 
data. 


If  the  control  values  received  by  an  instruction  do  not  match  their 
associated  gate  values,  the  instruction  is  not  enabled,  and  it  and  its  sue- 
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cessors  are  not  executed  in  the  activation. 

Many  simultaneous  activations  of  a data-flow  procedure  may  exist  as  a 
result  of  concurrent  or  recursive  procedure  application.  In  order  to  avoid 
the  possibility  of  interaction  between  tokens  from  separate  activations,  a 
new  copy  of  a procedure  is  created  for  each  activation,  the  argument  struc- 
ture is  transmitted  to  the  new  copy,  and  after  a result  is  returned,  the 


copy  is  discarded. 
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Chapter  3 

ARCHITECTURE  OF  THE  DATA-FLOW  PROCESSOR 

The  data-flow  processor  described  in  this  chapter  is  designed  to  direct- 
ly  execute  programs  expressed  in  the  data-flow  language  presented  in  Chapter 
2.  The  structure  of  the  processor  is  presented  in  four  stages.  The  first 
section  of  the  chapter  discusses  the  representation  of  instructions  within 
the  processor  and  the  execution  of  individual  instructions  representing 
operators  and  deciders  of  a program.  The  next  section  extends  the  descrip- 
tion to  include  the  processing  of  structures.  The  third  section  presents 

the  multi-level  memory  structure  utilized  by  the  processor  in  which  the  mem- 

- 

ories  of  the  instruction  and  structure  processing  sections  of  the  processor 
act  as  caches  for  the  most  active  instructions  and  structure  values.  The 
final  section  describes  the  implementation  of  procedures  within  the  archi- 
tecture. 

3. 1 Instruction  Processing 

The  instructions  of  a data-flow  program  are  stored  and  executed  in  the 
instruction  processing  section  of  the  processor  (Figure  3-1).  Instructions 
awaiting  execution  are  contained  in  the  Instruction  Meaory.  Upon  becoming 
ready  for  execution,  an  instruction  enters  the  Arbitration  Network  and  is 
conveyed  by  the  Arbitration  Network  to  the  correct  Operation  or  Decision 
Unit.  The  results  of  an  operation  are  distributed  to  the  desired  destina- 
tion instructions  by  a Distribution  Network.  Similarly,  the  results  of  a 
decision  are  distributed  by  a Control  Network 
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operation 


Figure  3.1.  Organization  of  the  instruction  processing  section 
of  the  data-flow  processor. 
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3.1.1  Instruction  Representation 

The  instructions  of  a program  being  executed  are  stored  in  the  Instruc- 
tion Memory  of  the  processor.  The  Instruction  Memory  contains  a number  of 
Instruction  Cells,  each  holding  one  instruction  of  the  data-flow  program. 

Each  Instruction  Cell  consists  of  a number  of  registers,  say  five  (Figure 
3.2)  and  holds  the  instruction  in  the  specified  format  together  with  spaces 
for  receiving  its  operands.  An  Instruction  Cell  is  designated  by  an  identi- 
fier which  specifies  a path  to  that  Cell  through  the  Distribution  and  Control 
Ne tworks. 

Each  instruction  corresponds  to  an  operator,  a decider,  or  a Boolean 
operator  of  a data-flow  program.  As  in  the  structured  representation  of 
the  program,  the  gate  and  merge  actors  are  not  represented  by  separate  in- 
structions; rather,  the  function  of  the  gates  is  incorporated  into  the  in- 
structions associated  with  operators  and  deciders  in  a manner  that  will  be 
described  shortly,  and  the  function  of  the  merge  actors  is  implemented  for 
free  by  the  nature  of  the  Distribution  Network. 

The  first  register  of  an  Instruction  Cell  holds  an  instruction  which 
encodes  in  its  operation  code  the  function  to  be  performed,  that  is,  the 
type  of  actor  represented  by  the  Cell.  The  register  specifies  in  its  des- 
tination field  the  Cell  identifier  of  an  instruction  which  is  to  receive 
one  copy  of  the  result. 

Each  other  register  of  the  Cell  can  hold  either  a data  operand,  a Bool- 

' 

ean  operand  and  one  destination,  or  two  destinations.  A register  can  also 
be  empty,  indicating  that  it  is  not  used  by  the  instruction  currently  oc- 
cupying the  Cell.  The  use  of  the  register  is  Indicated  by  a use  code  in  the 
first  field  of  the  register.  If  four  data  operands  are  used  in  an  instruc- 
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data 
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use  code  < 
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contains  a data  operand 
contains  a Boolean  operand  and 
a destination 

dest  contains  two  destinations 


empty  not  used  by  this  instruction 


Figure  3.2.  Format  of  fields  in  an  Instruction  Cell. 
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tion,  only  one  destination  can  be  specified,  and  that  destination  must  be  a 
distribution  instruction  (Figure  3.3)  if  more  than  one  destination  is  desired 
for  the  result. 

A register  containing  the  components  designated  by  an  operand  selector 
in  an  instruction  consists  of  two  parts,  a gating  code  gl,  g2  and  either  a 
data  receiver  vl  or  a control  receiver  cl.  The  gating  codes  permit  repre- 
sentation of  gate  actors  that  control  the  reception  of  operand  values  by 
the  operator  or  decider  represented  by  the  Instruction  Cell.  The  meanings 
of  the  code  values  are  as  follows: 

meaning 

the  associated  operand  is  not  gated 

an  operand  value  is  accepted  by  arrival  of  a true 
control  value;  discarded  by  arrival  of  a false  value 

an  operand  value  is  accepted  by  arrival  of  a false 
control  value;  discarded  by  arrival  of  a true  value 

the  operand  is  a constant  value 
The  structure  of  a data  or  control  receiver  (Figure  3.4)  provides  space 
to  receive  a data  or  Boolean  value,  and  two  flag  fields  in  which  the  arrival 
of  data  and  control  values  is  recorded.  The  gate  flag  is  changed  from  off 
to  true  or  false  by  a true  or  false  control  value.  The  value  flag  is  changed 
from  off  to  on  by  a data  or  Boolean  value  according  to  the  type  of  receiver. 

An  initial  configuration  of  Instruction  Cells  corresponding  to  the  data- 
flow program  of  Figure  2.11  is  given  in  Figure  3.5. 

3.1.2  Operation  of  an  Instruction  Cell 

The  function  of  each  Instruction  Cell  is  to  receive  data  and  control  val- 


code value 
no 

true 


false 


const 


ues,  and,  when  the  Cell  becomes  enabled,  to  transmit  the  contents  of  the  Cell 
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to  a Functional  Unit  (either  an  Operation  Unit  or  a Decision  Unit,  determined 
by  the  actor  type).  An  Instruction  Cell  becomes  enabled  just  when  all  its 
registers  are  enabled.  A register  specified  to  act  an  an  instruction  reg- 
ister is  always  enabled.  Registers  specified  to  act  as  operand  registers 
change  state  with  the  arrival  of  values  directed  to  them.  The  state  tran- 
sitions and  enabling  rules  for  data  operand  registers  are  defined  in  Figure 
3.6. 

In  Figure  3.6  the  contents  of  an  operand  register  are  represented  as 
follows : 


no:  (off,  off)  I I empty 

t ft  t 

true:  (true.,  on)  * filled  and  enabled 


The  asterisk  indicates  that  the  register  is  enabled.  Events  denoting  the 

arrival  of  data  and  control  values  are  labelled  thus: 

d data  value 

t true  control  value 

f false  control  value 

Note  that  upon  arrival  of  a data  value  and  a control  value  that  does  not 
match  the  gating  code  of  the  register,  the  register  enters  a trap  state 
which  indicates  that  the  instruction  contained  in  the  Cell  is  not  to  be 
executed.  In  a correct  program,  if  one  register  of  a given  Cell  enters 
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no:  (off,  off)  | | (off  . on) 


true:  (off , off) 


(°ff » on) 

J ■"  ■>  (true  , off) 

V^.  (false  . off)  | | « 


(off , on) 

false:  (off,  off)  (false  . off)  | 1 


(false,  trj&mZm 
(true . on) 

(false , trap)|j 


(true . trap) 
(false , on) 


(true  , off)  1 I " " > (true  , trap) 


const:  (off , on) 


Figure  3.6.  State  transition  and  enabling  rules  for  data 
operand  registers. 
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the  trap  state,  all  will  enter  the  trap  state. 

To  illustrate  the  operation  of  an  Instruction  Cell,  let  us  examine  the 
data-flow  program  of  Figure  3.5.  Cell  0 of  the  program  is  enabled  upon 
receiving  the  identifier  of  a structure  a in  register  2.  The  entire  con- 
tents  of  the  enabled  Cell  are  transmitted  as  an  operation  packet  to  some 
distribution  Operation  Unit  (Figure  3.7).  The  Operation  Unit  merely  cre- 
ates two  data  packets , each  consisting  of  a destination  address  and  one 
result,  and  sends  each  to  the  appropriate  register  specified  in  the  des- 

■ 

tination  address  fields  of  the  instruction.  Upon  receipt  by  the  Memory  of 
the  data  packets  directed  to  Cells  1 and  2,  both  Cells  are  enabled,  and  each 
transmits  its  contents  as  an  operation  packet  to  an  Operation  Unit  capable 
of  performing  the  select  operation.  - * 

Cell  4 represents  a decider  which,  when  enabled,  transmits  its  contents 
as  a decision  packet  to  a Decision  Unit.  The  Decision  Unit,  upon  receiving 
a decision  packet,  performs  the  specified  test  and  sends  the  resulting  con- 
trol packets  to  the  designated  Cells.  j 

3.1.3  Network  Structure 

To  connect  the  Instruction  Cells  of  the  Memory  to  the  Operation  and  De- 
cision Units,  a network,  called  the  Arbitration  Network,  provides  a path  from 
each  Instruction  Cell  to  each  Operation  or  Decision  Unit.  Operation  and  de- 
cision packets  are  transmitted  from  Instruction  Cells  into  the  Arbitration 
Network.  The  network  is  capable  of  accepting  many  packets  simultaneously 
and  delivers  each  packet  to  the  correct  Functional  Unit. 

Upon  receiving  an  operation  packet,  an  Operation  Unit  performs  the  func- 


A 
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tion  specified  by  the  operation  code  on  the  operands  of  the  packet  and  pro- 
duces a data  packet  for  each  destination  specified  in  the  instruction.  A 
Distribution  Network  concurrently  accepts  data  packets  from  the  Operation 
Units  and,  using  the  destination  address  of  each  packet,  delivers  it  to  the 
specified  Instruction  Cell.  Similarly,  the  control  packets  produced  by  a 
Decision  Unit  are  sent  to  the  Control  Network  for  delivery  to  the  designated 
Instruction  Cells. 

A simplified  structure  of  the  Arbitration  and  Distribution  Networks  is 
presented  in  Figure  3.8.  The  networks  are  composed  of  three  types  of  units. 
An  arbitration  unit  passes  packets  arriving  at  its  input  ports  one-at-a- 
time  to  its  output  port,  using  a round-robin  discipline  to  resolve  any  con- 
flicts. A switch  unit  passes  a packet  at  its  input  to  one  of  its  outputs, 
controlled  by  some  property  of  the  packet.  In  the  Arbitration  Network  this 
property  is  the  operation  code,  whereas  in  the  Distribution  Network,  the 
switch  units  are  controlled  by  the  destination  address.  A buffer  unit  stores 
a packet  until  the  succeeding  switch  or  arbitration  unit  is  ready  to  accept 
it. 

Due  to  the  large  number  of  inputs  to  the  Arbitration  Network,  we  wish 
to  transfer  data  between  the  Memory  Cells  and  the  Arbitration  Network  in 
serial  format  to  reduce  the  number  of  wires  necessary.  However,  in  order 
to  maintain  a high  rate  of  packet  flow  at  the  output  ports,  we  wish  to  trans- 
fer packets  to  the  Functional  Units  in  parallel  format.  For  this  reason, 
serial-to-parallel  conversion  is  done  gradually  within  the  buffer  units  as 
a packet  travels  through  the  Arbitration  Network.  Parallel- to- serial  con- 
version is  performed  in  the  Distribution  Network  for  similar  reasons. 
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(b)  Distribution  Network 


Figure  3.8. 


Structure  of  the  Arbitration  and 
Distribution  Networks. 
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The  structure  of  the  Control  Network  is  similar  to  that  of  the  Distri- 
bution Network.  However,  all  packets  within  the  Control  Network  contain 
simple  Boolean  values,  and  hence  the  parallel-to-serial  conversion  and  buf- 
fering of  the  Distribution  Network  is  not  needed,  and  the  Control  Network 
is  composed  of  only  switch  and  arbitration  units. 

3.2  Structure  Hand  ling 

The  physical  representation  of  a structure  within  a computer  system  may 
be  viewed  in  several  different  ways.  One  extreme  involves  implementing  the 
structure  as  it  is  represented  in  the  data-flow  model,  that  is,  as  an  acyclic 
directed  graph  in  which  each  node  is  either  a structure  node  or  an  elementary 
node.  In  such  an  implementation,  each  node  of  the  graph  occupies  a number 
of  storage  locations  within  the  processor.  The  location(s)  containing  a 
structure  node  hold  the  identifiers  of  the  locations  containing  nodes  which 
are  successors  of  that  node.  The  location  representing  an  elementary  node 
holds  an  elementary  value.  The  nodes  of  a structure  represented  in  this 
fashion  msv  be  scattered  throughout  the  memory  of  the  processor. 

Alternatively,  all  elementary  values  of  a structure  may  be  stored  to- 
gether in  a group  of  locations.  The  first  few  locations  of  the  group  then 
contain  a mapping  function  which  allows  one  to  find  the  location  of  a spe- 
cific element  within  the  group.  This  method  is  often  used  for  the  repre- 
» 

sentation  of  arrays  within  a conventional  computer  system. 

The  first  approach  has  the  problem  that  the  storage  of  a structure  in 
such  a manner  can  occupy  a great  deal  of  space  within  the  memory.  Not  only 
must  the  data  be  stored,  but  a large  number  of  structure  nodes  and  associated 
pointers  must  also  be  located  within  the  memory.  Accessing  an  elementary 
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value  in  a graph  can  cake  a long  time  as  a path  is  followed  over  the  arcs 
of  the  graph  to  the  desired  node.  On  the  other  hand,  a single  structure 
represented  by  the  second  approach  occupies  much  less  room,  but  the  repre- 
sentation of  several  structures  in  such  a manner  can  be  very  expensive  in 
terms  of  space  since  components  of  a structure  cannot  be  shared  as  they  can 
in  the  graph  approach. 

It  would  seem  that  perhaps  a combination  of  these  two  methods  could  be 
efficiently  utilized;  that  is,  a structure  representation  in  which  each  node 
of  the  structure  is  a small  block  of  data.  In  the  remainder  of  this  section 
we  present  an  approach  in  which  data  structures,  described  as  acyclic  di- 
rected graphs,  are  implemented  in  such  a hybrid  manner. 

3.2.1  Simple  Structures 

The  storage  of  structures  and  the  execution  of  the  structure  actors  de- 
scribed in  Section  2.2  occurs  in  a separate  structure  processing  section 
within  the  data-flow  processor.  The  structure  processing  section  consists 
of  a Structure  Operation  Unit  and  a Structure  Memory  and  attendant  Arbitra- 
tion and  Distribution  Networks.  This  section  of  the  processor  is  viewed  as 
an  Operation  Unit  by  the  Instruction  Memory;  that  Is,  packets  specifying 
structure  operations  are  sent  to  the  section,  and  data  packets  are  returned. 
The  organization  of  the  data-flow  processor  with  the  addition  of  the  struc- 
ture processing  capability  is  shown  in  Figure  3.9. 

Packets  specifying  structure  operations  are  received  by  the  Structure 
Memory  and  the  Structure  Operation  Unit.  Instructions  which  require  the 
creation  of  new  structure  nodes  are  processed  by  the  Structure  Operation 
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Figure  3.9.  Organization  of  the  data-flow  processor  without 
multi-level  memory. 
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Unit.  The  unit  controls  the  performance  of  the  instruction  specified  in  each 
operation  packet  through  instruction  packets  sent  to  the  Structure  Memory  and 
sends  as  data  packets  the  identifiers  of  the  resulting  structures  to  the  in- 
struction processing  section.  All  structure  operations  other  than  the  allo- 
cation of  a new  node  are  performed  within  the  Structure  Memory. 

To  illustrate  the  operation  of  the  structure  processing  section  of  the 
processor,  in  this  section  we  shall  limit  our  consideration  to  structures 
represented  as  binary  trees.  A selector  of  such  a structure  can  have  one 
of  two  values,  L (left)  and  R (right).  Such  structures  are  well-known  from 
their  use  in  Lisp  [21]. 

A node  of  a structure  is  contained  in  a two  register  Cell  known  as  a 
Structure  Cell  and  designated  by  a Cell  identifier.  The  two  registers  of 
the  Cell  contain  the  left  and  right  components  of  the  structure,  respec- 
tively; and  hence  no  selector  need  be  stored  in  a register.  The  first 
field  of  a register  is  a use  code  which  indicates  whether  the  item  stored 
in  the  second  field  is  the  identifier  of  another  Cell  or  an  elementary  val- 
ue or  the  register  is  empty.  A memory  representation  of  the  simple  struc- 
ture of  Figure  2.4  is  presented  in  Figure  3.10. 

The  Structure  Memory  is  composed  of  a number  of  Structure  Cells  in  a 
manner  similar  to  the  way  the  Instruction  Memory  is  formed  of  a number  of 
Instruction  Cells.  Each  Structure  Cell  is  capable  of  holding  one  node  of  a 
structure,  and  the  identifier  of  the  Cell  specifies  a path  through  the  Dis- 
tribution Network  to  the  Cell.  The  Structure  Memory  receives  instruction 
packets  from  the  Instruction  Memory  and  the  Structure  Operation  Unit  command- 
ing a specific  Structure  Cell  to  execute  some  structure  operation  upon  the 


node  located  in  the  Cell. 
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Each  Structure  Cell  within  the  Structure  Memory  is  capable  of  perform- 
ing one  of  two  operations  upon  the  structure  node  contained  in  the  Cell. 

The  possible  operations  are: 

1.  select.  Upon  receipt  of  an  instruction  packet  specifying  a select 
operation 


{ 


select  destl 

s J 


a Structure  Cell  follows  one  of  two  procedures,  controlled  by 
whether  s is  a simple  or  compound  selector. 

a.  If  s is  a simple  selector,  the  content  c of  the  register  desig- 
nated by  8 is  used  to  form  a data  packet 


M 


which  is  presented  to  the  Arbitration  Network  for  transmission 

to  the  instruction  processing  section  of  the  processor. 

If  s is  a compound  selector  s.s  •••s  , the  content  0 of  the 

i i.  n 

register  designated  by  s^  is  the  identifier  of  some  other 
Structure  Cell  and  is  used  to  form  the  instruction  packet 


P 

< select  dest  ► 
. s2  * * " sn 


which  is  presented  to  the  Arbitration  Network  for  transmission 
to  the  input  Distribution  Network  of  the  Structure  Memory. 

The  process  is  then  repeated  with  the  selector  S£  at  Struc- 
ture Cell  p. 

2.  alter.  The  receipt  of  an  alter  instruction 
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r 

! , 


alter 

- s L 

X 

. p . 

indicates  that  the  Structure  Cell  is  to  contain  a copy  of  the  node 
P with  the  component  of  P designated  by  the  selector  s set  to  x. 
First,  a copy  of  node  p is  retrieved  from  the  auxiliary  memory 
through  use  of  a retrieve  command  packet  in  the  manner  to  be  de- 
scribed in  Section  3.3.3.  Once  the  copy  of  P is  present  in  the 
Cell,  the  value  contained  in  the  register  designated  by  the  selec- 
tor s is  changed  to  x,  and  the  use  code  of  the  register  is  set  to 
the  appropriate  value  (elem.  struc , or  empty)  , designated  by  the 
tag  of  x.  If  P is  0,  no  node  is  requested  from  the  auxiliary 
memory;  the  current  contents  of  the  Structure  Cell  are  altered  in 
the  manner  described. 

The  format  of  an  instruction  packet  received  at  the  input  Distribution 
Network  of  the  Structure  Memory  differs  from  the  format  of  an  operation 
packet  transmitted  to  a Functional  Unit  or  the  Structure  Operation  Unit  due 
to  the  fact  that  the  operation  code  of  an  instruction  packet  does  not  con- 
trol the  switching  within  the  Distribution  Network;  rather,  the  Cell  identi- 
fier is  used  to  direct  an  instruction  packet  toward  the  correct  Structure 
Cell.  Hence,  an  instruction  packet  in  the  Distribution  Network  has  the  fol- 
lowing format 

{?} 


A 


where  a is  the  identifier  of  some  Structure  Cell  in  the  Structure  Memory 
and  i specifies  one  of  the  two  operations  which  can  be  performed  by  a Struc- 
ture Cell  and  contains  the  necessary  operands. 
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Packets  containing  instructions  that  designate  structure  operations 
are  transmitted  to  the  structure  processing  section  of  the  processor  from 
the  Instruction  Memory.  A packet  specifying  a select  instruction  is  trans- 
mitted directly  to  the  Structure  Memory  as  an  instruction  packet.  Struc- 
ture operation  packets  representing  the  other  structure  instructions  are 
transmitted  to  the  Structure  Operation  Unit.  The  necessity  of  processing 
each  operation  packet  within  the  Structure  Operation  Unit  is  due  to  the 
required  allocation  of  one  or  more  free  Structure  Cells  for  the  execution 
of  each  instruction  with  the  exception  of  the  select  instruction.  The 
Structure  Operation  Unit  performs  the  allocation  of  a free  Cell  simply  by 
accepting  the  identifier  of  a Cell  over  the  unid  port . The  manner  in  which 
these  identifiers  are  provided  and  Structure  Cells  are  freed  for  use  by 
new  structures  is  described  in  Section  3.3.2. 

Now  that  we  have  considered  the  operation  of  a Structure  Cell  within 
the  Structure  Memory,  we  can  describe  the  execution  of  each  of  the  remain- 
ing structure  actors  of  Section  2.2  merely  by  listing  the  procedure  followed 
by  the  Structure  Operation  Unit  in  processing  the  instruction.  For  the  pur- 
poses of  this  discussion,  it  is  assumed  that  all  selectors  are  simple  selec- 
tors. 

A construct  instruction 

{construct  dest  | 
si:  a f 

s2:  y J 

specifies  that  a new  node  is  to  be  created  with  components  a and  y»  desig- 
nated by  the  selectors  si  and  s2.  The  instruction  is  implemented  by  the 
Structure  Operation  Unit  as  a number  of  alter  operations  in  the  following 


manner : 
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1.  Accept  an  identifier  0 from  the  unid  port. 

2.  Transmit  to  the  Structure  Memory  the  instruction  packets 


P 

alter 

si 

a 

0 


P 

alter 

s2 

Y 

0 


transferring  the  values  a and  y to  the  correct  registers  of  0. 

3.  Transmit  to  the  instruction  processing  section  the  data  packet: 


1*  destl 

1 P J 


An  operation  packet  containing  an  append  instruction  is  of  the  following 


format 


append 

- 6 

x 

a 


where  s is  the  selector  of  the  element  in  Structure  Cell  a which  is  to  be 
replaced  by  x in  the  new  structure.  The  procedure  followed  by  the  Structure 
Operation  Unit  in  execution  of  the  instruction  is  as  follows: 

1.  Accept  an  identifier  p from  the  unid  port. 

2.  Transmit  the  instruction  packet 

r p i 

| alter  I 


to  the  Structure  Memory  to  copy  node  a into  Cell  0 and  change  the 
component  of  0 designated  by  the  selector  s to  x. 

3.  Transmit  to  the  instruction  processing  section  the  data  packet: 


J"  destl 

l P J 
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An  operation  packet  specifying  a delete  instruction 

^ delete  dest 


e dest  I 

'o  ; 


is  processed  in  a similar  manner: 

1.  Accept  an' identifier  P from  the  unid  port. 

2.  Transmit  the  instruction  packet 

P 

alter 

* S ► 

0 

a 

to  the  Structure  Memory,  indicating  that  the  use  code  of  the  regis- 
ter designated  by  s in  Cell  p is  to  be  set  to  empty. 

I! 

3.  Transmit  the  data  packet 

{dest  1 

P J 

to  the  instruction  processing  section. 

3.2.2  Extension  to  More  Complex  Structures 

The  extension  of  the  described  techniques  for  the  implementation  of 
data  structures  to  larger  and  more  complex  structures  is  straightforward. 

In  order  to  implement  structures  with  a fixed  maximum  number  of  arcs  eman- 
ating from  each  node,  the  size  of  a Structure  Cell  is  increased  to  accom- 
modate the  new  node  size.  The  use  of  arbitrary  (to  a fixed  maximum  size) 
integers  or  character  strings  as  selectors  can  be  accommodated  through  the 
addition  of  a selector  field  to  each  register.  A Structure  Cell  must  then 
have  the  capability  to  choose  from  the  node  contained  in  the  Cell  an  item 
whose  selector  matches  a specified  selector.  These  extensions  allow  the 
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representation  of  fairly  powerful  structures,  including  the  data-flow  pro- 
cedures presented  in  Section  3.4.  A further  extension  to  allow  a node  to 
have  arbitrary  number  of  emanating  arcs  introduces  a great  deal  of  complex- 
ity since  it  might  be  necessary  to  use  several  Cells  to  hold  the  identifiers 
of  all  Cells  which  contain  successors  of  the  node.  To  avoid  this  complex- 
ity, a node  of  a structure  in  the  data-flow  processor  is  of  fixed  size,  and 
each  arc  emanating  from  the  node  has  a fixed  size  selector  associated  with 
it. 

3.3  Multi-Level  Memory  Structure 

Each  node  of  a data-flow  procedure  will  fire  at  most  once  during  exe- 
cution of  the  procedure  within  the  data-flow  processor.  A large  number  of 
nodes  of  the  procedure  will  not  fire  at  all  if  any  conditionals  are  present 
in  the  program.  Thus,  it  would  be  wasteful  to  assign  an  Instruction  Cell  to 
each  instruction  of  a procedure  when  the  procedure  is  activated.  For  this 
reason  the  instruction  processing  section  of  the  data-flow  processor  incor- 
porates a multi-level  memory  system  such  that  only  the  active  instructions 
of  a program  occupy  the  Instruction  Cells  of  the  processor.  Similarly,  in 
order  to  assure  maximum  use  of  the  Structure  Cells  of  the  processor,  the 
structure  processing  section  utilizes  a multi-level  memory  to  insure  that 
only  active  structure  nodes  occupy  the  Structure  Cells.  Individual  instruc- 
tions and  structure  nodes  are  retrieved  from  their  respective  auxiliary  mem- 
ories as  they  become  required  for  computation.  Instructions  are  returned  to 
the  auxiliary  memory  only  when  the  Instruction  Cells  holding  them  are  re- 
quired for  more  active  parts  of  the  program.  Structure  nodes  are  sent  to 
the  auxiliary  memory  upon  creation  through  execution  of  an  append . delete  . 
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or  construct  instruction. 

The  structure  of  the  instruction  processing  section  of  the  data-flow 
processor  with  the  addition  of  a multi-level  memory  system  is  shown  in  Fig- 
ure 3.11.  Data-flow  programs  are  located  within  the  Packet  Memory  System, 
and  each  instruction  of  a program  is  sent  to  the  Instruction  Memory  as  a 
retrieve  packet  when  it  is  needed  for  program  execution,  designated  by  the 
arrival  of  an  operand  of  the  instruction  at  some  Instruction  Cell.  Instruc- 
tions which  occupy  Instruction  Cells  needed  for  more  active  instructions  are 
returned  to  the  Packet  Memory  System  as  store  packets.  Each  instruction  so 
discarded  is  later  retrieved  from  the  Packet  Memory  System  upon  arrival  at 
the  Instruction  Memory  of  another  of  its  operands.  The  Memory  Command  Net- 
work provides  the  correct  sequencing  of  store  and  retrieve  commands  trans- 
mitted to  the  Packet  Memory  System. 

The  multi-level  memory  system  of  the  structure  processing  section  of 
the  processor  (Figure  3.12)  operates  in  a manner  similar  to  that  of  the 
multi-level  memory  in  the  instruction  processing  section.  A Structure  Cell 
is  transferred  to  the  Structure  Memory  from  the  Packet  Memory  System  when 
required  for  execution  of  an  instruction.  A newly  created  structure  node 
is  returned  to  the  Packet  Memory  System  after  the  execution  of  an  append  , 
delete,  or  construct  instruction.  The  identifiers  of  free  nodes  are  main- 
tained within  the  Packet  Memory  System  and  are  requested  by. the  Structure 
Operation  Unit  at  the  getid  port.  Upon  receiving  such  a request,  the  Pack- 
et Memory  System  returns  the  identifier  of  a free  node  to  the  Operation  Unit 
at  its  unid  port.  The  new  Identifier  is  used  for  the  creation  of  a new 
structure  node  during  execution  of  an  instruction  specified  in  a structure 


operation  packet. 
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Figure  3.11.  Organization  of  the  instruction  processing 
section  without  multi-level  memory. 
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Figure  3.12.  Organization  of  the  structure  processing 
section  with  multi-level  memory. 


t 
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The  addition  of  a multi-level  memory  system  to  each  section  of  the 
processor  does  not  affect  the  manner  in  which  instructions  are  processed 
within  the  machine.  The  operation  of  Instruction  Cells  and  Structure  Cells 
proceeds  in  the  manner  described  in  Sections  3.1  and  3.2.  The  multi-level 
memory  structure  merely  assures  that  the  individual  Instruction  Cells  and 
Structure  Cells  are  more  highly  utilized  during  execution  of  a program. 


3.3.1  Specification  of  the  Memory  System 

Ihe  techniques  of  conveying  information  through  a computer  system  in 
fixed  size  packets  and  of  breaking  the  system  into  a large  number  of  iden- 
tical units  to  exploit  the  achievable  concurrency  can  also  be  applied  to  the 
structure  of  a computer  system  memory.  Such  a structure,  known  as  a Packet 
Memory  System,  has  been  proposed  by  Dennis  [11]- 

The  Packet  Memory  System  shown  in  Figure  3.13  contains  programs  and  data 
represented  as  acyclic  directed  graphs  in  the  manner  described  in  Sections 
2.2  and  2.3.  The  restriction  to  acyclic  graphs  permits  the  reference  count 
technique  of  memory  management  to  be  used,  rather  than  a general  garbage 
collector.  Each  node  of  a graph  has  an  associated  reference  count  which  is 
equal  to  the  number  of  arcs  terminating  on  the  node  plus  the  number  of  ref- 
erences to  the  node  existing  in  the  processor.  When  a node  becomes  inacces- 
sible due  to  the  execution  of  some  instruction  of  the  program,  the  reference 
count  of  the  node  will  become  0. 

The  directed  graphs  are  represented  within  the  Packet  Memory  System 
through  a collection  of  items  I,  each  of  which  is  designated  by  an  element 
from  a set  of  unique  identifiers  U.  An  item  can  be  one  of  three  types: 

1.  empty.  The  item  contains  the  value  nil.  However,  the  item  may  or 


i 
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may  not  be  accessible. 


2.  elementary  value, 
where : 


3.  structure  value. 


where : 


u:  felem,  e,  r} 

u € U unique  identifier  of  the  item  (unid) 
e 6 E elementary  value 
r € I reference  count 

u0:  f struc , u^  ....  uq,  r} 

V U1 Un6  U 

r € I 


The  unique  identifiers  of  inaccessible  items  are  maintained  in  a collection 
T of  free  unique  identifiers. 

The  state  of  the  Packet  Memory  System  is  represented  by  the  pair  {i-T, 
T) . The  initial  state  is  [0,  U} , all  items  of  the  Memory  are  empty,  and 
their  identifiers  are  free. 

The  operation  of  the  Packet  Memory  System  is  described  by  considering 
the  four  types  of  transactions  performed  by  the  Memory  System: 


1.  Store  transaction. 


A store  packet  { i , elem , e , 1}  or 

{i,  struc , u^,  . u , 1} 

presented  at  the  store  port  of  the  Packet  Memory  System  requests 

storage  of  the  item  with  unid  i: 

f elem , e,  1}  or 
f struc , u^,  ...,  ur>  1} 

However,  the  storage  of  the  item  is  not  effective  until  a store 


command  packet 


{i,  store! 


is  received  by  the  Packet  Memory  System  at  its  command  port . and 
any  prior  retrieval  requests  have  taken  affect. 
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2.  Retrieval  transaction. 

An  item  with  unid  i is  retrieved  from  the  Packet  Memory  System  by 

means  or  a retrieve  command  packet 

{ i , retrt  or 
{i,  retr,  j] 

presented  at  the  command  port  of  the  Memory.  A retrieve  packet 
conveying  the  contents  of  the  item  identified  by  i is  eventually 
delivered  at  the  retrieve  port  of  the  Memory.  In  the  first  case, 
the  unid  i controls  delivery  of  the  retrieve  packet,  in  the  second, 
the  identifier  j specifies  the  destination  for  the  packet. 

3.  Reference  count  transaction. 

The  incrementing  or  decrementing  of  the  reference  count  of  an  item 

with  unid  i is  accomplished  by  an 

{i,  up}  or 
{i,  dwnt 

command  packet  delivered  at  the  command  port.  If  the  new  reference 
count  of  the  item  identified  by  i is  0: 

a.  If  the  item  is  a structure 

i:  fstruc,  u, , ...»  u ,0} 

i n 

the  command  packets 

{u1>  dwn} , •••,  {un,  dwn} 
appear  at  the  reference  count  (rc)  port. 

b.  Unid  i is  added  to  T. 


4.  Unique  identifier  generation. 

A free  unique  identifier  is  requested  by  means  of  a command  packet 


^ 
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{ra,  get id] 

presented  at  the  getid  port , where  m designates  the  unit  requesting 
the  identifier.  A unid  i is  removed  from  T and  appears  in  a unid 
packet 

{m,  i} 

at  the  unid  port  of  the  Memory  System.  This  transaction  can  only 
occur  if  T 0. 

The  Packet  Memory  System  may  be  organized  as  a set  of  smaller  Packet 
Memory  System  modules  among  which  packets  are  distributed  by  a Distribution 
Network  and  formed  into  common  streams  by  an  Arbitration  Network.  Such  an 
organization  permits  the  system  to  handle  large  numbers  of  storage  and  re- 
trieval requests  concurrently  and  is  discussed  further  in  [11]. 

To  insure  the  proper  maintanence  of  items  in  the  Packet  Memory  System, 
the  Structure  Memory  and  Instruction  Memory  must  send  the  appropriate  dwn 
and  u£  command  packets  to  the  Memory  System  of  the  structure  processing 
section  as  references  to  items  are  deleted  and  created  during  instruction 
execution.  A reference  to  an  item  is  deleted  by  the  execution  in  the  Struc- 
ture Memory  of  any  instruction  representing  a structure  actor.  A reference 
to  an  item  is  created  in  the  Structure  Memory  by  execution  of  a select  in- 
struction if  the  selected  item  is  a structure  value.  A reference  to  an  item 
can  be  created  in  the  Instruction  Memory  through  execution  of  a data  distri- 
bution instruction  containing  a unid  as  its  operand.  We  must  require  that 
upon  the  enabling  of  such  an  instruction,  the  Instruction  Cell  containing 
the  instruction  provide  a command  packet  of  the  form  {unid,  up]  for  each 
destination  in  excess  of  one. 

J 
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3.3.2  Organization  of  the  Active  Memories 

The  use  of  a multi-level  memory  system  within  each  section  of  the  data- 
flow processor  requires  that  the  Instruction  Memory  and  Structure  Memory  act 
as  caches  for  the  most  active  instructions  and  structure  nodes.  For  appli- 
cation of  the  cache  principle  to  the  architecture,  the  Instruction  and  Struc- 
ture Cells  of  the  processor  are  organized  into  groups  of  Cells,  known  as  Cell 
Blocks. 

A packet  destined  for  the  Instruction  Memory  or  Structure  Memory  can  no 
longer  identify  its  destination  by  use  of  a Cell  identifier.  Each  packet  in 
the  processor  contains  a node  identifier  which  specifies  a destination  in  the 
processor.  A node  identifier  specifying  a destination  in  either  the  Structure 
Memory  or  the  Instruction  Memory  is  the  unique  identifier  of  the  node  to  which 
the  packet  is  destined.  The  identifier  is  divided  into  two  parts,  a major 
address  and  a minor  address . each  containing  a portion  of  the  identifier.  One 
Cell  Block  of  each  section  of  the  processor  is  associated  with  each  possible 
major  address.  However,  the  two  memories  may  be  of  different  size  and  hence 
use  different  sizes  of  major  address. 

All  instructions  having  the  same  major  address  are  processed  by  the  In- 
struction Cells  of  the  corresponding  Instruction  Cell  Block.  Thus,  the  Dis- 
tribution and  Control  Networks  use  the  major  address  to  direct  data  packets, 
control  packets,  and  retrieve  packets  to  the  appropriate  Instruction  Cell 
Blocks.  Similarly,  all  structure  nodes  with  the  same  major  address  are  pro- 
cessed within  the  same  Structure  Cell  Block,  and  the  major  address  serves  to 
direct  retrieve  and  instruction  packets  to  the  correct  Cell  Block.  The  packet 
delivered  to  a Cell  Block  includes  the  minor  address,  which  serves  an  an 
identifier  for  that  packet  within  the  Cell  Block. 
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Figure  3.14.  Organization  of  a Structure  Cell  Block. 
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Only  Cells  in  occupied  status  are  candidates  for  displacement. 

Whenever  a node  in  the  Structure  Memory  is  created  through  execution 
of  a construct . append . or  delete  instruction,  a store  packet  conveying 
the  contents  of  the  node  and  a store  command  packet  are  sent  to  the  Packet 
Memory  System.  Hence,  when  an  item  in  the  Structure  Memory  is  chosen  for 
preemption,  it  is  not  necessary  to  return  it  to  the  Packet  Memory  System. 

Operation  of  a Structure  Cell  Block  can  be  specified  by  giving  two 
procedures  — one  initiated  by  arrival  of  an  instruction  packet  from  the 
Instruction  Memory,  the  Structure  Operation  Unit,  or  another  Cell  of  the 
Structure  Memory,  the  other  activated  by  arrival  of  a retrieve  packet  from 
the  Packet  Memory  System. 

Procedure  1:  Arrival  of  an  instruction  packet 

r n 

^ op.  dest 

L w 

where  n designates  the  minor  address  of  the  node  to  be  operated 


upon 

op  designates  the  operation  to  be  performed:  select  or  alter 
dest  provides  a node  identifier  for  the  value  resulting  from 
a select  operation 

w consists  of  the  necessary  operands  for  the  performance 
of  the  operation  op 

step  1.  If  the  Association  Table  does  not  have  an  entry  with  minor  address 
n,  go  to  step  3.  If  there  is  an  entry  with  minor  address  n,  let 
p be  the  Cell  corresponding  to  the  entry;  continue  with  step  2. 
step  2.  If  0£  specifies  an  alter  operation  and  the  final  operand  in  w is 
not  0,  go  to  step  6;  otherwise  go  to  step  7. 


. 


step  3. 


step  4. 
step  5. 


step  7. 


Procedure 


step  1. 


step  4. 


If  the  Association  Table  shows  that  no  Structure  Cell  has  status 
free , go  to  step  4.  Otherwise  let  p be  a Cell  with  status  free ; 
go  to  step  5. 

Use  the  Stack  to  choose  a Cell  p in  occupied  status  for  preemption 
If  o£  specifies  an  alter  operation  and  the  final  operand  in  w is 
0,  set  the  status  of  Cell  p to  {n,  occupied!  ; go  to  step  7.  Other- 
wise continue  with  step  6. 

Change  the  entry  in  the  Association  Table  for  Cell  p to  {n,  engagec 
Transmit  the  appropriate  retrieve  command  packet  to  the  Packet  Mem 
ory  System  via  the  Memory  Command  Network. 

Transmit  the  contents  of  the  instruction  packet  to  the  Cell  p.  If 
the  status  of  the  Cell  is  engaged . await  the  arrival  of  a retrieve 
packet;  otherwise  continue  with  step  8. 

If  Cell  p is  occupied . perform  the  structure  operation  specified 
by  o£  on  the  contents  of  Cell  p.  Change  the  order  of  Cells  in  the 
Stack  to  make  Cell  p the  last  candidate  for  displacement. 

_ 2:  Arrival  of  a retrieve  packet  {n,  x}  with  minor  address  n and 
content  x. 

Let  p be  the  Structure  Cell  with  entry  {n,  engaged!  in  the  Associa- 
tion Table. 

Transmit  the  contents  of  the  retrieve  packet  to  the  Cell  p. 

Change  the  Association  Table  entry  for  Cell  p from  [n,  engaged} 
to  {n,  occupied! . 

Perform  the  operation  specified  by  the  instruction  present  at  Cell 


p upon  the  new  contents  of  the  Cell.  Change  the  order  of  Cells  in 
the  Stack  to  make  Cell  p the  last  candidate  for  displacement. 
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3.3.4  Operation  of  an  Instruction  Cell  B lock 

The  composition  of  an  Instruction  Cell  Block  in  the  Instruction  Memory 
i * 

is  similar  to  that  of  a Structure  Cell  Block,  consisting  of  a number  of  In- 
struction Cells,  an  Association  Table,  and  a Stack  (Figure  3.15).  The  In- 
struction Cell  Block  acts  as  a cache  in  a manner  similar  to  the  operation  of 

a Structure  Cell  Block.  However,  an  instruction  in  an  Instruction  Cell  Block 

I \ 

is  not  retained  in  a Cell  after  being  sent  to  a Functional  Unit;  rather,  the 
Cell  is  freed  for  use  by  another  instruction.  Also,  an  item  in  an  Instruc- 
tion Cell  Block  is  returned  to  the  Packet  Memory  System  only  if  the  Cell  it 
occupies  is  needed  for  a more  active  instruction. 

The  Association  Table  of  an  Instruction  Cell  Block  must  have  greater 
capability  than  one  in  a Structure  Cell  Block  in  order  to  keep  track  of  in- 
structions which  have  been  returned  to  the  Packet  Memory  System.  The  pos- 
sible values  of  the  status  indicator  in  the  Association  Table  of  an  Instruc- 
tion Cell  Block  are: 

free 
engaged 
occupied 
absent 

The  first  three  status  values  have  the  same  meaning  as  the  corresponding 
values  in  a Structure  Cell  Block.  The  status  absent  is  used  to  indicate 
that  the  associated  instruction  has  been  displaced  from  the  Cell  Block  to 

4 

the  Packet  Memory  System. 

Since  an  Association  Table  in  an  Instruction  Cell  Block  can  contain  an 
undetermined  number  of  items  with  status  absent , the  structure  of  the  Asso- 


ciation Table  must  be  extended  beyond  that  utilized  in  a Structure  Cell 
Block.  The  Association  Table  of  an  Instruction  Cell  Block  is  divided  into  two 


Figure  3.15.  Organization  of  an  Instruction  Cell  Block. 
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sections  (Figure  3.16).  The  first  section  corresponds  to  the  Association 
Table  of  a Structure  Cell  Block  and  contains  an  entry  for  each  Instruction 
Cell,  giving  its  status.  The  second  section  holds  one  node  of  the  absent 
list.  This  list  contains  the  identifiers  of  all  instructions  with  status 
absent , those  which  were  displaced  from  the  Cell  Block  before  receiving  all 
operands  necessary  for  their  enabling.  Each  entry  in  the  absent  list  is 
of  the  form  [m,  i}  , where  m is  the  minor  address  of  the  instruction  and  i 
is  the  unid  within  the  Packet  Memory  System  to  which  it  has  been  displaced. 

If  more  absent  items  are  associated  with  a Cell  Block  than  can  be  ac- 
commodated in  the  Association  Table,  the  absent  list  is  continued  in  the 
Packet  Memory  System.  If  the  absent  list  of  a given  Association  Table  has 
overflowed  into  the  Packet  Memory  System,  it  may  be  necessary  to  retrieve 
the  nodes  of  the  list  from  the  Packet  Memory  System  to  discover  if  a given 
instruction  has  status  absent.  The  last  entry  in  the  Association  Table  of 
an  Instruction  Cell  Block  is  the  unique  identifier  in  the  Packet  Memory  Sys- 
tem of  the  first  node  of  the  portion  of  the  absent  list  located  in  the  Mem- 
ory System.  If  the  value  of  the  last  entry  is  nil . the  absent  list  is  com- 
pletely contained  in  the  Association  Table. 

The  number  of  Instruction  Cells  in  an  Instruction  Cell  Block  and  the 
size  of  the  Association  Table  must  be  properly  chosen  in  a data-flow  pro- 
cessor so  that  only  a small  number  of  instructions  associated  with  a Cell 
Block  have  status  absent  at  any  one  time.  This  insures  that  the  absent  list 
for  a Cell  Block  is,  in  most  cases,  completely  contained  in  the  Association 
Table  of  that  Cell  Block. 

A data  or  control  packet  received  by  an  Instruction  Cell  Block  may  not 


be  completely  processed  at  one  time  if  its  destination  instruction  is  not 
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Association  Table 


Figure  3.16.  Structure  of  the  Association  Table  for 
an  Instruction  Cell  Block. 
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listed  in  the  Association  Table  of  the  Cell  Block  and  the  absent  list  is 
not  completely  contained  in  the  Association  Table.  If  such  is  the  case,  the 
processing  of  the  packet  will  be  interrupted  to  retrieve  nodes  of  the  absent 
list  from  the  Packet  Memory  System,  and  a number  of  retrieve  packets  convey- 
ing instructions  may  be  processed  before  the  requested  node  of  the  absent 
list  arrives. 

The  minor  address  of  a retrieve  packet  conveying  a node  of  the  absent 
list  is  0,  distinguishing  the  packet  from  retrieve  packets  conveying  instruc- 
tions. 

Once  again,  operation  of  the  Cell  Block  is  specified  by  a number  of  pro- 
cedures. Procedures  3 and  4 maintain  and  search  the  absent  list  and  are  used 
when  necessary  by  Procedures  1 and  2.  Each  of  the  first  two  procedures  is 
activated  by  the  arrival  of  a packet  at  the  Cell  Block  — one  by  the  arrival 
of  a data  or  control  packet,  and  the  other  by  arrival  of  a retrieve  packet 
from  the  Packet  Memory  System. 

Procedure  1:  Arrival  of  a data  or  control  packet  {n,  y}  where  n is  a minor 
address,  and  y is  the  packet  content. 

step  1.  Does  the  first  section  of  the  Association  Table  have  an  entry  with 

minor  address  n.  If  so,  let  p be  the  Instruction  Cell  corresponding 
to  the  entry  and  go  to  step  5.  Otherwise  continue  with  step  2. 

step  2.  If  the  Association  Table  shows  that  no  Instruction  Cell  has  status 

free , go  to  step  3.  Otherwise  let  p be  a Cell  with  status  free. 

Let  the  Association  Table  entry  for  p be  {.,  free);  go  to  step  4. 

step  3.  Use  the  Stack  to  choose  a Cell  p in  occupied  status  for  preemption; 

return  Cell  p to  the  Packet  Memory  System  through  execution  of 
Procedure  3;  then  continue  with  step  4. 
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step  4. 


step  5. 


step  6. 


step  7. 


Procedure 


step  1. 


step  2 . 


Make  an  entry  {n,  engaged!  for  Cell  p in  the  Association  Table. 
Determine  whether  the  instruction  with  minor  address  n has  status 
absent  through  execution  of  Procedure  4. 

Update  the  operand  register  of  Cell  p having  minor  address  n ac- 
cording to  the  content  y of  the  data  or  control  packet  (the  rules 
for  updating  are  given  in  Figure  3.6).  If  Cell  p is  occupied . 
the  state  change  of  the  register  must  be  consistent  with  the  in- 
struction code,  or  the  program  is  invalid.  If  Cell  p is  engaged  . 
the  change  must  be  consistent  with  the  register  status  left  by 
preceeding  packet  arrivals. 

If  all  operand  registers  of  Cell  p are  in  the  trap  state  as  defined 
in  Figure  3.6,  set  the  Association  Table  entry  for  Cell  p to  {n, 
free}.  Otherwise  continue  with  step  7. 

If  Cell  p is  occupied: 

a.  If  all  three  registers  are  enabled  (according  to  the  rules  of 
Figure  3.6),  the  Cell  p is  enabled;  transmit  an  operation  or 
decision  packet  to  the  Arbitration  Network  and  set  the  status 
of  Cell  p to  free. 

b.  If  the  Cell  p is  not  enabled,  change  the  order  of  Cells  in  the 
Stack  to  make  Cell  p the  last  candidate  for  displacement. 

2:  Arrival  of  a retrieve  packet  {n,  x}  with  minor  address  n and 
content  x. 

If  the  minor  address  n is  0,  the  packet  is  a node  of  the  absent 
list  and  has  been  requested  and  is  processed  by  Procedure  4.  If 
the  minor  address  is  not  0,  continue  with  step  2. 

Let  p be  the  Instruction  Ce'l  with  entry  [n,  engaged!  in  the  Asso- 
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step  3. 

step  4. 

step  5. 

step  6. 


Procedure 


step  2 . 


step  3. 


ciation  Table. 

The  status  of  the  operand  registers  of  Cell  p must  be  consistent 
with  the  content  x of  the  retrieve  packet,  or  the  program  is  in- 
valid. Update  the  contents  of  Cell  p to  incorporate  the  instruc- 
tion and  operand  status  information  in  the  retrieve  packet. 

Change  the  Association  Table  entry  for  Cell  p from  fn,  engaged} 
to  (n,  occupied! . 

If  all  operand  registers  of  Cell  p are  in  the  trap  state  as  defined 
in  Figure  3.6,  set  the  Association  Table  entry  for  Cell  p to  [n, 
free! . Otherwise  continue  with  step  6. 

If  all  registers  of  Cell  p are  enabled,  then  Cell  p is  enabled: 
transmit  the  Cell  contents  to  the  Arbitration  Network  and  set  the 
status  of  the  Cell  to  free. 

3:  Return  Cell  p with  status  {n,  occupied!  to  the  Packet  Memory 
System. 

Transmit  a {0,  getid}  command  packet  to  the  Packet  Memory  System  to 
obtain  a free  unid.  The  major  address  of  the  Cell  Block  is  appended 
to  the  minor  address  0 of  the  command  packet  as  it  travels  through 
the  Memory  Command  Network. 

Process  all  retrieve  packets  received  (Procedure  2)  until  a unid 
packet  {0,  i}  arrives  at  the  Cell  Block.  Upon  its  arrival,  trans- 
mit to  the  Packet  Memory  System  the  store  packet  [i,  x} , where  x 
is  the  contents  of  Cell  p.  Transmit  the  store  command  packet 
{i,  store}  to  the  Memory  Command  Network. 

Is  the  portion  of  the  Association  Table  holding  the  absent  entries 


full?  If  not,  go  to  step  5.  Otherwise  transmit  another  {0,  getid} 


-79- 


command  packet  to  the  Packet  Memory  System  and  continue  to  process 
retrieve  packets  with  Procedure  2. 

step  4.  Upon  receipt  of  the  unid  packet  {0,  j}  , transmit  the  content  z of 
the  second  section  of  the  Association  Table  to  the  Packet  Memory 
System  as  a store  packet  {j,  z} . Send  the  store  command  packet 
{j,  store}  to  the  Packet  Memory  System.  Set  the  last  entry  in  the 
Association  Table  to  j. 

step  5.  Add  the  entry  {n,  i}  to  the  absent  list  in  the  Association  Table. 

Procedure  4:  Search  the  absent  list  for  an  entry  with  minor  address  n. 

step  1.  Let  P be  the  value  of  the  last  entry  of  the  Association  Table. 

step  2.  Does  the  node  of  the  absent  list  located  in  the  Association  Table 

contain  an  entry  with  minor  address  n?  If  not,  go  to  step  3.  Other- 
wise let  the  entry  with  minor  address  n be  {n,  m} ; go  to  step  6. 

step  3.  Let  a be  the  value  of  the  last  entry  of  the  Association  Table.  If 
a is  nil , go  to  step  5.  Otherwise  transmit  the  retrieve  command 
packet  {a,  retr,  0}  to  the  Packet  Memory  System  through  the  Memory 
Command  Network  and  continue  to  process  retrieve  packets  using 
Procedure  2 . 

step  4.  Upon  receipt  of  the  requested  retrieve  packet,  set  the  last  entry 
of  the  Association  Table  to  the  value  designated  by  the  selector 
next  in  the  retrieve  packet.  Let  z be  the  content  of  the  absent 
list  located  in  the  Association  Table;  transsdt  the  store  packet 
{a,  z}  and  the  store  command  packet  {a,  store}  to  the  Packet  Memory 
System  through  the  Arbitration  Network  and  the  Memory  Command  Net- 
work. Transmit  the  content  of  the  retrieve  packet  to  the  absent 
list  in  the  Association  Table;  go  to  step  2. 
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Set  the  last  entry  of  the  Association  Table  to  p;  transmit  the 
command  packet  [n,  retr]  to  the  Packet  Memory  System  via  the  Mem- 
ory Command  Network;  go  to  step  8. 

Transmit  the  command  packet  {m,  retr , n]  to  the  Packet  Memory  Sys- 
tem through  the  Memory  Command  Network;  transmit  the  command  packet 
{m,  dwn]  to  the  Packet  Memory  System. 

Delete  the  entry  {n,  m]  from  the  absent  items  in  the  Association 
Table.  If  no  absent  items  are  left  in  the  Association  Table  and 
p £ nil . send  the  command  packets  {p,  retr . 0}  and  {p,  dwn]  to  the 
Memory  Command  Network,  and  upon  receipt  of  the  retrieve  packet, 
transmit  it  to  the  absent  list  in  the  Association  Table.  Other- 
wise set  the  last  entry  of  the  Association  Table  to  p. 

Return  to  Procedure  1. 


3.4  Procedure  Imp lementation 

All  programs  executed  within  the  data-flow  processor  are  data-flow  pro- 
cedures. Although  this  restriction  to  programs  represented  as  acyclic  di- 
rected graphs  rules  out  the  performance  of  an  iterative  computation,  we  shall 
see  in  Chapter  4 that  this  is  not  a serious  restriction,  and  that  iterative 
computation  can  be  efficiently  expressed  and  performed  through  recursive 
procedure  activation. 

The  procedures  of  the  data-flow  language  are  represented  as  acyclic 
directed  graphs  in  the  manner  described  in  Section  2.3.  A procedure  can 
thus  be  manipulated  as  a structure  within  the  structure  processing  section 
of  the  processor  through  use  of  the  structure  operations  construct . append . 
itlgt  e , and  select  defined  in  Section  2.2.  A compiler  can  construct  a data- 


step  5. 


step  6. 


step  7. 


step  8. 
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flow  program  as  a structure  within  tha  Structure  Memory,  and  modified  ver- 
sions of  a program  can  be  readily  created. 

Each  instruction  of  a data-flow  procedure  is  represented  as  one  node 
in  the  program  structure.  An  operand  in  an  instruction,  consisting  of  a 
gating  code  and  a receiver,  occupies  one  register  in  an  Instruction  Cell, 
and  its  components  are  located  in  the  same  register  of  a Structure  Cell 
containing  the  instruction.  A destination  identifier  occupies  a complete 
register  in  a Structure  Cell,  whereas  it  occupies  only  a portion  of  a reg- 
ister in  an  Instruction  Cell.  This  is  necessary  to  allow  a procedure  to 
be  processed  as  a structure  in  the  Structure  Memory,  yet  occupy  the  minimum 
amount  of  space  in  the  Instruction  Memory.  The  changes  in  format  of  an  in- 
struction occur  as  it  is  transferred  into  and  out  of  an  Instruction  Cell, 
hence  the  format  of  the  instruction  is  that  of  the  structured  representation 
at  all  times  when  it  is  not  resident  in  an  Instruction  Cell. 

The  formation  of  programs  within  the  structure  processing  section  of 
the  data-flow  processor  and  their  subsequent  execution  in  the  instruction 
processing  section  requires  that  the  Packet  Memory  Systems  of  the  two  sec- 
tions be  combined  for  efficient  operation.  The  organization  of  the  complete 
data-flow  processor  with  combined  Packet  Memories  is  shown  in  Figure  3.17. 

Although  we  could  eliminate  the  Instruction  Memory  and  extend  the  ca- 
pability and  complexity  of  the  Structure  Memory  to  allow  it  to  execute  a 
data-flow  procedure  by  means  of  the  structure  operations  described  for  in- 
struction execution  in  Section  2.3,  there  are  a number  of  reasons  for  con- 
tinuing to  process  instructions  in  a sep -rate  Instruction  Memory.  First, 
each  instruction  is  only  used  once  and  then  discarded  from  the  Instruction 
Memory,  whereas  structure  nodes  are  retained  within  the  Structure  Memory 

- - J 
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Figure  3.17.  Organization  of  the  data-flow  processor. 
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after  being  used.  Hence  the  algorithm  to  handle  the  Association  Table 
and  Stack  in  an  Instruction  Cell  Block  differs  greatly  from  the  one  used 
in  a Structure  Cell  Block.  Second,  the  operations  performed  by  an  Instruc- 
tion Cell  are  very  dissimilar  to  those  performed  by  a Structure  Cell,  and 
the  complexity  of  a Cell  designed  to  handle  both  would  be  excessive.  Third, 
the  use  of  specialized  Instruction  Cells  allows  the  function  of  gates  to  be 
incorporated  into  the  instructions  associated  with  operators  and  deciders, 
rather  than  using  separate  instructions  for  the  representation  of  gate  ac- 
tors. Last,  the  use  of  separate  memories  reduces  somewhat  the  size  of  each 
Arbitration  and  Distribution  Network,  increasing  the  utilization  of  each 
memory  [24]. 

Each  node  of  a program  structure  representing  a data-flow  procedure  is 
identified  by  its  unique  identifier  within  the  Packet  Memory  System  of  the 
processor.  The  contents  of  the  nodes  designated  by  destination  selectors  in 
an  instruction  are  the  unique  identifiers  of  the  destination  instructions. 

A procedure  in  the  data-flow  processor  is  identified  by  the  unid  of  its 
first  instruction. 

There  are  a number  of  possible  approaches  to  the  execution  of  a proce- 
dure in  the  data-flow  processor.  The  "copy  rule"  for  procedure  activation 
could  be  strictly  followed,  creating  a unique  copy  of  the  procedure  for  each 
activation.  Or,  a unique  activation  record  could  be  formed,  containing  the 
data  associated  with  an  activation.  Both  of  these  approaches  have  the  same 
major  problems.  The  copying  of  an  arbitrary  program  structure  or  activation 
record  is  not  easily  accomplished  within  the  processor  due  to  the  presence 
of  shared  nodes.  Also,  using  a separate  structure  for  each  activation  is 
wasteful;  such  a structure  would  not  be  fully  utilized  since  the  number  of 


A 
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unexecuted  instructions  in  a program  increases  at  a rapid  rate  with  the 
number  of  conditionals  in  the  program,  and  the  version  of  a program  stored 
in  the  Packet  Memory  System  is  only  modified  when  instructions  of  the  pro- 
gram are  displaced  from  the  Instruction  Memory.  In  a properly  structured 
processor,  the  number  of  instructions  returned  to  the  Packet  Memory  System 
should  be  small. 

The  approach  presented  in  this  thesis,  while  not  as  elegant  as  a strict 
implementation  of  the  copy  rule,  is  more  efficient  in  that  a new  copy  of  an 
instruction  is  created  only  when  necessary  to  avoid  conflict.  One  copy  of 
a procedure  is  maintained  in  the  Packet  Memory  System,  and  each  activation 
of  the  procedure  retrieves  its  instructions  from  that  copy.  A procedure 
activation  is  uniquely  identified  by  the  unid  of  its  argument  structure.  The 
node  identifier  of  an  instruction  sent  to  the  Instruction  Memory  is  formed 
by  a concatenation  of  the  argument  structure  identifier  and  the  instruction 
identifier.  The  argument  structure  identifier  must  be  associated  with  all 
instruction  identifiers  during  execution  of  the  procedure. 

The  only  possibility  of  conflict  between  separate  activations  of  a pro- 
cedure arises  when  a partially  enabled  instruction  is  returned  to  the  Packet 
Memory  System  from  an  Instruction  Cell  Block.  Conflict  is  avoided  by  assign- 
ing a new  unid  to  an  instruction  which  is  returned  to  the  Packet  Memory  System 
and  changing  its  status  in  the  Cell  Block  to  absent,  as  described  in  Section 
3.3.4.  We  are  assured  that  in  a correct  program  an  instruction  with  status 
absent  will  eventually  receive  another  operand  and  be  recalled  to  the  Cell 
Block  from  which  it  was  displaced. 

The  Structure  Operation  Unit  of  the  processor  is  responsible  for  the 
processing  of  apply  and  return  instructions.  An  apply  instruction  can  be 
transmitted  to  the  Structure  Operation  Unit  from  either  the  Instruction  Mem- 
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ory  or  from  outside  the  processor.  The  instruction  is  of  the  following  format 


{apply  dest 

l J 


and  specifies  that  the  procedure  with  root  node  P is  to  be  applied  to  the 
structure  with  root  node  a.  The  Structure  Operation  Unit  first  creates  an 
argument  structure  with  components  a and  the  destination  dest.  This  argu- 
ment structure  is  then  presented  to  a new  activation  of  P. 

Upon  reaching  a return  instruction  in  the  execution  of  a procedure, 
the  Structure  Operation  Unit  is  sent  an  operation  packet  of  the  form 


return  - 
* dest  • 
Y 


where  dest  is  the  destination  specified  in  the  apply  instruction  which 
activated  the  procedure,  and  y is  the  unid  of  the  result  structure.  The 
Structure  Operation  Unit  merely  forms  a data  packet  of  the  destination  and 
the  result  and  presents  the,  packet  to  the  Distribution  Network  associated 
with  the  Instruction  Memory: 


M 


The  process  of  procedure  activation  can  be  examined  more  closely  by 
considering  the  activation  of  the  procedure  P shown  in  Figure  2.11.  Upon 
receipt  by  the  Structure  Operation  Unit  of  an  apply  instruction 


{apply  dest 

s J 


the  following  sequence  occurs: 

1.  An  argument  structure  6 is  created  with  destination  dest  and 
argument  a. 
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2.  The  data  packet 

{:} 

is  sent  to  the  Instruction  Memory.  P is  an  identifier  formed  by 
the  concatenation  of  the  unique  identifiers  6 and  P.  A command 
packet  {P,  retr,  p}  is  sent  from  the  Instruction  Memory  to  the 
Packet  Memory  System  to  retrieve  the  instruction  P and  send  it 
to  destination  p. 

Since  the  initial  instruction  of  P is  merely  a distribution  instruction 
which  sends  the  input  structure  to  the  select  instructions,  the  Instruction 
Cell  assigned  to  P is  enabled  as  soon  as  the  retrieve  packet  conveying  P 
arrives  at  the  Cell.  Upon  its  becoming  enabled: 

1.  An  operation  packet  of  the  following  format  is  sent  to  the  Arbitra- 
tion Network: 

{dist  destl*| 

dest2„  ' j 

2.  The  status  of  the  Instruction  Cell  holding  P is  set  to  free. 

The  execution  of  the  procedure  then  proceeds  in  accordance  with  the  rules 
presented  previously. 


> 


f 
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Chapter  4 

RECURSIVE  vs.  ITERATIVE  REPRESENTATION 

To  illustrate  the  reason  for  restricting  programs  within  the  data-flow 
processor  to  those  represented  as  acyclic  directed  graphs,  this  chapter  ex- 
amines the  performance  tradeoff  between  iterative  and  re.cursive  computation 

i 

within  the  data-flow  architecture.  The  restriction  is  necessitated  by  the 
threat  of  deadlock  which  immediately  arises  in  a cyclic  computation  such  as 
iteration.  In  this  chapter  we  consider  the  impact  on  processor  performance 
of  the  necessary  modifications  to  an  iterative  program  to  assure  freedom 
from  deadlock.  The  performance  of  iterative  and  recursive  program  struc- 
tures within  the  data-flow  processor  is  then  examined  to  demonstrate  the 
greater  efficiency  of  the  acyclic  recursive  version. 


4.1  The  Nature  of  the  Deadlock  Problem 


A close  examination  of  the  architectural  basis  of  the  data-flow  pro- 
cessor immediately  brings  one  face-to-face  with  one  of  the  classic  problems 
of  parallel  computation,  that  of  deadlock.  This  deadlock  problem  manifests 
itself  in  a manner  which  affects  the  fundamental  operation  of  the  processor. 
Therefore,  modifications  to  the  architecture  are  necessary  in  order  to  pre- 
vent a deadlock  condition  from  arising  in  a cyclic  or  stream-oriented  com- 
putation. 

The  nature  of  the  deadlock  problem  and  its  solution  for  stream-orienteJ 
computation  is  thoroughly  discussed  in  [23].  In  this  section  we  consider 
the  implications  of  this  problem  for  iterative  program  structures  and  pre- 
sent a solution  to  the  problem.  In  Section  4.3  we  examine  the  impact  of  this 
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solution  upon  the  performance  of  an  iterative  data-flow  program. 

The  firing  rule  for  an  operator  or  link  of  a data-flow  program  states 
that  the  operator  or  link  cannot  become  enabled  unless  there  is  no  token 
on  any  output  arc  of  that  operator  or  link.  However,  the  architecture  of 
the  processor  described  in  Chapter  3 provides  no  mechanism  by  which  an  in- 
struction can  check  that  the  registers  specified  in  its  destination  address 
fields  are  empty.  Consequently,  the  firing  rule  can  be  violated. 

The  highly  parallel  and  cyclic  nature  of  an  iterative  computation  can 

allow  several  cycles  of  the  iteration  to  be  simultaneously  active.  Due  to 

the  fact  that  the  firing  rule  is  not  observed,  tokens  from  different  cycles 

can  then  interact  within  the  program.  In  illustration  of  how  a deadlock 

condition  arises  from  this  interaction,  consider  the  iterative  data-flow 

program  of  Figure  2.3  which  represents  the  following  computation: 

input  x,  y 
n :=  0 
while  joy  do 
x :=  x-y 
n : = n+1 

end 

output  x,  n 

Upon  completion  of  the  program,  n is  equal  to  the  original  value  of  x divided 
by  y,  and  x is  equal  to  the  remainder.  A detailed  description  of  the  opera- 
tion of  this  program  is  presented  in  Section  2.1. 

In  the  program  of  Figure  2.3  it  is  possible  for  the  decider  to  fire  a 
significant  time  before  the  subtraction  operator.  Once  the  decider  has  fired, 
if  the  result  of  the  decision  is  true,  a token  conveying  the  value  y is  re- 
turned to  the  input  of  the  program  through  the  left-most  merge  actor.  If  the 
subtraction  operator  has  not  fired  by  this  time,  then  two  tokens  carrying  the 
value  y can  be  simultaneously  present  on  an  input  arc  of  the  operator.  Within 
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the  processor,  a data  packet  conveying  the  new  value  of  y destined  for  the 
subtraction  operator  is  stored  in  a buffer  unit  of  the  Distribution  Network 
since  the  destination  register  is  occupied.  The  stored  packet  blocks  access 
to  succeeding  switch  units,  and  a deadlock  condition  arises  if  the  data  pack- 
et conveying  the  value  x necessary  for  the  enabling  of  the  subtraction  op- 
erator is  blocked  by  this  stored  packet. 

Note  that  such  blocking  can  also  occur  within  the  Distribution  Network 
of  the  structure  processing  section  of  the  machine.  If  a number  of  opera- 
tion packets  are  destined  for  the  same  Structure  Cell,  they  may  temporarily 
block  other  packets  in  the  Distribution  Network.  However,  no  deadlock  con- 
dition can  arise  since  a blocking  packet  will  eventually  move  on  and  cannot 
block  packets  which  are  needed  by  the  Cell  to  which  it  is  destined. 

The  solution  to  the  deadlock  problem  in  the  Instruction  Memory  requires 
the  addition  of  a form  of  feedback  between  operators  of  a program  in  order 
to  force  the  program  to  observe  the  firing  rule.  In  the  case  of  the  Itera- 
tive data-flow  program  of  Figure  2.3,  the  feedback  assures  that  all  opera- 
tions of  one  cycle  of  the  iteration  are  concluded  before  the  next  cycle  is 
initiated.  The  feedback  is  accomplished  by  placing  a decider  with  the  nil 
predicate  on  the  output  link  of  each  gate  actor  of  the  program.  Each  of 
these  deciders,  upon  receiving  a data  value  indicating  that  the  gate  has 
fired,  produces  a control-valued  control  token.  All  control-valued  tokens 
so  produced  are  ANDed , and  the  resulting  token  is  used  to  reenable  the  pro- 
gram input.  The  deadlock-free  version  of  the  iterative  data-flow  program 
of  Figure  2.3  is  shown  in  Figure  4.1. 


The  merge  actors  can  no  longer  be  ignored  in  the  implementation  of  the 
data-flow  program  since  they  are  now  utilized  to  control  the  initiation  of 


Figure  4.1.  Deadlock-free  version  of  the  iterative  data-flow 
program  of  Figure  2.3. 
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each  cycle  of  the  iteration.  Each  merge  actor  has  a gate  on  its  output. 

This  gate  allows  the  merge  actor  to  become  enabled  only  after  receiving  a 
data  value  and  a control- valued  control  token.  Thus,  a token  indicating 
that  the  previous  cycle  of  the  iteration  has  been  completed  must  be  re- 
ceived before  initiation  of  a new  cycle. 

The  cost  of  implementing  this  deadlock  solution  is  high,  both  in  terms 
of  space  and  time.  The  instruction  register  of  an  Instruction  Cell  must  be 
augmented  to  allow  the  receipt  (when  necessary)  of  a control -valued  token 
for  the  enabling  of  that  register  [23].  A number  of  Boolean  operators  must 
be  added  to  the  program  to  perform  the  ANDing  of  the  control -valued  packets, 
and  a merge  instruction  must  be  implemented  in  the  architecture. 


4.2  Performance  of  the  Architecture 

In  this  section  we  describe  an  elementary  analysis  of  the  performance 
of  a data-flow  processor.  The  performance  of  the  processor  is  examined 
through  consideration  of  the  flow  of  packets  within  the  networks  of  the  pro- 
cessor. This  section  is  intended  to  give  us  a framework  within  which  to 
intelligently  compare  the  performance  of  cyclic  and  acyclic  representations 
of  a computation  and  does  not  touch  upon  peripheral  issues  such  as  the  proper 
structuring  of  the  networks  of  the  processor.  A more  detailed  analysis  of 
the  performance  of  a data-flow  architecture  and  rules  for  the  structuring 
of  the  various  networks  are  presented  in  [24]. 

The  performance  of  a data-flow  processor  can  be  measured  through  con- 
sideration of  the  minimum  elapsed  time  between  the  enabling  of  an  instruc- 
tion and  the  arrival  of  its  results  at  the  desired  destinations.  If  the  num- 


ber of  Functional  Units  is  chosen  and  the  networks  of  the  processor  are  struc- 
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tured  so  that  all  instructions  pass  through  the  Arbitration  Network  in  this 
time,  then  we  are  assured  of  maintaining  a constant  supply  of  instructions 
for  the  Functional  Units.  We  have  also  discovered  the  processing  capability 
of  the  machine. 

The  cycle  time  of  an  instruction  within  the  processor  is  the  minimum 
elapsed  time  between  the  enabling  of  the  instruction  and  the  arrival  of 
the  results  of  the  operation  specified  by  the  instruction  at  their  destin- 
ation Cells.  The  cycle  time  for  a given  instruction  is  affected  by  con- 
flict in  the  networks.  For  a simple  instruction  representing  an  operator 
of  a data-flow  program,  the  cycle  time  is  equal  to  the  passage  time  through 
the  Arbitration  Network,  the  Distribution  Network,  and  an  appropriate  Op- 
eration Unit.  The  delay  in  the  Operation  Unit  is  fixed  for  that  Operation 
Unit.  However,  the  network  delays  can  vary  greatly. 

The  cycle  time  for  an  instruction  is  found  by  considering  the  passage 
of  the  operation  packet  containing  that  instruction  through  the  Arbitration 
Network  and  the  passage  of  the  resulting  data  packets  through  the  Distribu- 
tion Network,  assuming  no  conflict  arises  in  either  network.  The  minimum 
delay  through  a network,  the  Arbitration  Network  for  example,  is  given  by 
the  summation  over  the  number  of  stages  in  the  network  of  the  time  required 
to  transfer  a packet  through  each  stage: 

minimum  delay  = E (no.  bits  serial  + l)(bit  transfer  time) 
stages 

The  transfer  time  for  a stage  is  equal  to  the  number  of  bits  passing  through 
the  stage  in  serial  plus  one  for  a signal  to  indicate  that  the  packet  is 
ready  to  be  transferred  multiplied  by  the  time  necessary  to  transfer  a bit. 

A similar  equation  applies  to  delay  in  the  Distribution  Network. 
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Let  us  examine  the  delay  within  a specific  Arbitration  Network  (Figure 
4.2).  This  network  has  three  stages  and  seven  arbitration  units.  Packets 
travel  through  stage  0 in  four-bit  serial  format  and  are  gradually  converted 
to  a more  parallel  format,  passing  through  stage  1 in  two-bit  serial  and 
stage  2 in  one-bit  serial  format.  As  noted  previously,  the  passage  time 
for  a packet  through  each  stage  is  equal  to  the  number  of  serial  bits  plus 
one  times  the  bit  transfer  time  t.  For  this  structure,  the  transfer  times 
are  5t , 3t,  and  2t,  respectively.  The  minimum  delay  through  the  network  is 
equal  to  the  summation  of  the  stage  delays,  or  10  t. 

In  order  to  find  the  time  necessary  to  process  all  instructions  con- 
tained in  the  Instruction  Memory  of  the  processor,  T,  we  must  consider  the 
maximum  delay  a packet  can  encounter  in  passing  through  the  Arbitration 
Network.  Such  a maximum  delay  can  only  occur  in  a network  which  has  a 
packet  present  at  every  node  in  a machine  in  which  every  Instruction  Cell 
Block  contains  an  enabled  instruction,  placing  a packet  on  each  input  to 
the  Arbitration  Network  (Figure  4.3).  The  maximum  delay  which  can  be  en- 
countered by  a packet,  say  the  triangular  one,  arises  only  when  all  other 
packets  in  the  network  and  at  the  inputs  of  the  network  pass  through  the 
output  of  the  network  before  the  triangular  one  does.  In  order  for  this  to 
happen,  not  only  must  the  triangular  packet  lose  every  conflict,  but  every 
packet  on  the  path  it  will  follow  to  the  output  must  also  lose  every  con- 
flict. Thus,  finding  the  maximum  delay  involves  examining  how  many  packets 

I 

flow  through  each  stage  before  the  triangular  one. 

For  this  network  the  worst  case  packet  will  be  the  14th  through  stage 
2,  the  6th  through  stage  1,  and  the  2nd  through  stage  0.  Multiplying  the 
number  of  packets  passing  through  each  stage  by  the  delay  in  that  stage, 


Stage 

Number 

Serial 

Bits 


Passage 

Time 


Figure  4.2.  Structure  of  an  elementary  Arbitration  Network 


-96- 


we  find  that: 

T ■ maximum  delay  = 2(5t)  + 6(3t)  + 14(2t) 

T « 56t 

Thus,  if  all  instructions  of  the  processor  are  enabled,  they  can  pass  through 
this  Arbitration  Network  in  a maximum  time  of  56t. 

However,  if  we  assume  that  the  network  size  is  such  that  the  cycle  time 
ie  slightly  less  than  T,  then  a number  of  destination  Cells  receive  their 
results  and  become  enabled  as  the  final  Cells  pass  through  the  Arbitration 
Network,  and  the  processing  rate  of  the  machine  can  be  measured  in  terms 
of  the  output  rate  of  the  Arbitration  Network  (assuming  the  Distribution 
and  Control  Networks  have  been  structured  to  distribute  all  results  as  fast 
as  they  are  produced).  In  such  a case,  the  rate  of  packet  transfer  to  each 
Operation  Unit  is  l/(2t),  and  the  maximum  processing  rate  of  the  machine 
is  [ 1/ (2 t) ] (number  of  Operation  Units). 

Furthermore,  if  each  arbitration  unit  has  enough  inputs  to  allow  a packet 
to  travel  through  the  previous  stage  in  a time  less  than  that  required  to  ser- 
vice all  busy  inputs,  the  passage  of  the  triangular  packet  through  the  first 
stages  of  the  Arbitration  Network  will  occur  simultaneously  with  the  trans- 
mission of  other  packets  at  the  output.  The  time  T for  the  transmission  of 
all  packets  in  the  network  to  the  Operation  Units  is  then  14(2t)  - 28t. 

4.3  Example : An  Iterative  Computation 

The  program  modifications  described  in  Section  4.1,  necessary  to  assure 
freedom  from  deadlock  in  an  iterative  computation,  cause  a significant  deg- 
radation in  the  performance  of  such  a computation.  Applying  the  analysis 
techniques  developed  in  the  previous  section  to  a specific  processor  allows 


-97- 


us  to  examine  the  actual  nature  of  the  performance  degradation. 

In  illustration  of  the  performance  of  a data-flow  processor  executing 
a cyclic  computation,  consider  a 128  Instruction  Cell  Block  machine  in 
which  an  operation  packet  exits  a Cell  Block  in  16-bit  parallel,  4-bit  ser- 
ial format.  For  a balanced  processor  structure,  one  in  which  the  number 
of  Functional  Units  is  matched  to  the  processing  rate,  the  processing  time 
T must  be  equal  to  the  activation  time,  that  is,  the  minimum  time  between 
successive  activations  of  an  instruction  within  the  processor.  Thus,  in 
order  to  determine  the  performance  of  the  processor,  we  must  consider  the 
structure  of  the  networks  and  determine  the  value  of  the  activation  time. 

The  modifications  to  an  iterative  program  necessary  to  assure  freedom 
from  deadlock  can  cause  a significant  delay  to  exist  between  successive 
activations  of  a given  instruction.  This  delay  is  due  to  the  fact  that 
each  Instruction  Cell  which  contains  a gated  operand  must,  as  it  sends 
out  an  operation  packet,  return  control  packets  to  preceding  actors  in 
the  program  which  supply  values  to  the  gated  register(s).  Upon  receipt  of 
these  control  values,  the  preceding  operators  are  reenabled  if  their  oper- 
ands are  present.  If  D is  the  cycle  time  for  an  operation  packet,  and  d is 
the  cycle  time  for  a control  packet;  that  is,  the  delay  through  the  Arbi- 
tration Network  and  the  Control  Network,  then  the  minimum  activation  time 
of  an  Instruction  Cell  is  D+d. 

In  order  to  obtain  a small  activation  time,  and  hence  a greater  pro- 
cessing capability,  the  networks  must  be  structured  with  as  few  stages  as 
possible.  However,  a minimum  of  three  stages  is  required  within  the  net- 
works of  this  processor  to  perform  the  serial- to-parallel  conversions  and 
still  maintain  the  necessary  throughput  from  stage  to  stage.  The  minimum 
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delay  analysis  of  the  three  stage  network  structure  of  the  128  Cell  Block 
processor  is  Identical  to  that  described  in  the  previous  section;  the  min- 
imum delay  in  the  Arbitration  Network  is  equal  to  10t. 

Assuming  the  delay  in  the  Distribution  Network  is  the  same  as  that  in 
the  Arbitration  Network,  and  the  minimum  delay  in  the  Control  Network  is  ap- 
proximately one-half  the  delay  in  the  Arbitration  Network  or  Distribution 
Network,  the  resulting  values  for  the  cycle  times  and  activation  time  AT 
are: 

D « 20t 
d « 15t 

AT  - D+d  = 35t 


If  t = 150  nanoseconds,  allowing  15  TTL  gate  delays  to  accomplish  one 

ready/acknowledge  cycle,  the  resulting  activation  time  is: 

AT  - 35(150  nsec.) 

= 5.25  psecond s 

And  the  maximum  processing  capability  of  the  architecture  containing  cyclic 
program  structures  is: 


processing  rate 


128  instructions 
5.25  ^seconds 

24  MIPS  (million  instructions  per  second) 


4.4  Recursive  Representation 

The  set  of  problems  which  can  be  expressed  iteratively  is  a subset  of 
those  suited  to  recursive  representation.  Therefore,  any  iterative  compu- 
tation can  be  expressed  in  the  data-flow  language  as  a procedure  and  can 
be  performed  through  recursive  procedure  activation. 

Naturally,  if  the  objective  of  a program  is  to  perform  the  same  opera- 


tion  on  a large  number  of  items,  then  in  a parallel  computer  system,  a fora  11 
construct  can  be  much  more  efficiently  utilized  than  an  iterative  program. 
However,  if  each  cycle  of  an  iteration  depends  upon  values  generated  in  pre- 
vious cycles,  a recursive  representation  can  be  utilized  to  avoid  the  cyclic 
structure  and  often  seems  to  provide  a more  readily  understandable  expression 
of  the  computation. 

The  acyclic  nature  of  a procedure  in  the  data-flow  language  assures  that 
even  though  the  firing  rule  may  not  be  observed,  two  tokens  cannot  simultan- 
eously exist  on  one  arc  of  a properly  structured  program  graph.  Hence,  no 
deadlock  condition  can  arise  within  an  activation  of  a procedure  or  between 
concurrent  or  recursive  activations  of  the  procedure,  and  the  procedure  can 
be  more  efficiently  executed  in  the  data-flow  architecture  than  a cyclic 
representation  of  the  same  computation. 

A recursive  description  of  the  iterative  computation  of  Figure  2.3 
is  stated  as  follows: 

P:  proc  (label  P)  (x,  y,  n) 
if  y>y  then  P(x-y,  y,  n+1) 
else  (x,  n) 

in  P(x,  y,  0) 

A data-flow  representation  for  P is  given  in  Figure  4.4.  The  argument  struc- 
ture for  the  procedure  contains  three  elements.  In  addition  to  the  argument 
and  the  destination,  it  holds  the  unid  of  P (label  P)  to  allow  the  procedure 
to  be  reapplied  if  necessary.  The  number  of  gates  is  reduced  in  the  recursive 
version,  reducing  the  number  of  control  distribution  Cells  which  are  needed 
in  the  program.  However,  a number  of  structure  operations  have  been  added, 
probably  balancing  the  savings  involved  in  eliminating  the  gates.  The  apply 
actor  in  the  program  passes  as  a destination  the  destination  it  received,  and, 

* _ 


elect  arg-x  select  arg*y  select  arg-n  select  dest  select  proc 


Figure  4.4.  Recursive  version  of  the  data-flow 
program  of  Figure  2.3. 
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after  the  firing  of  the  apply  actor,  the  activation  is  terminated.  When  the 
result  of  the  decision  is  finally  false  and  a return  instruction  is  executed, 
the  destination  to  which  the  values  x and  n are  sent  is  the  destination  spec- 
ified by  the  original  calling  procedure. 

Since  no  feedback  is  necessary  in  the  recursive  representation,  the  ac- 
tivation time  for  the  128  Cell  Block  processor  executing  a data-flow  proce- 
dure is  equal  to  the  cycle  time  D,  and  the  resulting  performance  of  the  pro- 
cessor is: 


maximum  processing  rate 


128  instructions 
3 psec. 

42  MIPS 


Not  only  does  a recursive  program  representation  occupy  less  space  in  the 
data-flow  processor  than  a cyclic  program  structure,  but  a processor  exe- 
cuting programs  represented  as  acyclic  directed  graphs  can  realize  almost 
twice  the  performance  achievable  through  the  execution  of  cyclic  program 
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Chapter  5 

TOPICS  FOR  FURTHER  RESEARCH 

In  this  thesis  the  data-flow  architecture  has  been  presented  as  a so- 
lution to  many  of  the  problems  of  highly  parallel  computer  systems.  The  use 
of  interconnection  networks  between  sections  of  the  processor  provides  an 
attractive  approach  to  the  communication  of  information  between  units  com- 
prising a large  system  and  allows  an  interesting  method  of  analyzing  system 
performance  through  the  examination  of  information  flow  within  the  networks. 

Due  to  the  radical  nature  of  the  architecture  presented,  undoubtably 
more  questions  have  been  raised  by  this  thesis  than  have  been  answered. 

These  questions  range  from  mild  quibblings  over  the  use  of  certain  methods 
of  representation  or  design  choices  to  deep  semantic  and  philosophical  is- 
sues. In  this  chapter  we  will  point  out  some  of  the  unanswered  questions 
and  indicate  some  of  the  issues  involved. 

The  data-flow  procedure  language,  while  appearing  to  be  a semantically 
elegant  method  of  expressing  parallelism,  is  wide  open  to  further  study  and 
extension.  The  language  ne?ds  to  be  expanded  by  the  addition  of  such  actors 
as  a fora 11  construct  to  enable  it  to  better  express  concurrent  processing 
of  the  elements  of  a structure.  However,  the  best  choice  of  semantics  for 
such  constructs  is  yet  to  be  clearly  established.  Also,  the  language  does 
not  currently  contain  the  capability  to  express  nondeterminate  computation, 
a very  important  feature  for  some  applications. 

Further  investigation  of  the  use  of  the  data-flow  language  is  also  nec- 
essary. Upon  initial  examination,  the  representation  of  such  algorithms  as 
the  fast  Fourier  transform  in  data-flow  form  appears  very  attractive  [25]. 
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However,  data-flow  representations  for  other  computations  need  to  be  devel- 
oped, examined,  and  contrasted  with  more  conventional  programs. 

The  data-flow  language  is  designed  to  serve  as  the  base  language  of  the 
data-flow  processor.  It  is  not  regarded  as  an  acceptable  user  language.  The 
development  of  a user  language  which  can  be  readily  translated  into  a data- 
flow representation  is  currently  under  study  [35].  The  translation  of  pro- 
grams expressed  in  a language  such  as  Algol  60  to  a data-flow  representation 
is  also  currently  being  investigated  [4].  Much  more  work  needs  to  be  done 
to  identify  concurrency  in  problems  and  to  take  advantage  of  that  concurrency 
through  use  of  the  data-flow  representation. 

An  analysis  of  data-flow  programs  is  necessary  for  a complete  under- 
standing of  the  operation  of  the  data-flow  processor.  In  order  to  determine 
the  number  of  instructions  which  must  be  maintained  in  the  Packet  Memory 
System  to  provide  the  necessary  number  of  active  instructions  to  allow  the 
Instruction  Memory  to  operate  at  its  highest  rate,  one  must  understand  the 
parallelism  achievable  in  a given  data-flow  program  and  the  number  of  active 
instructions  a program  can  be  expected  to  supply  at  any  point  in  time. 

An  important  issue  which  was  discussed  briefly  in  this  thesis  is  the 
tradeoff  between  cyclic  and  acyclic  representations  of  a computation.  Due 
to  the  highly  parallel  nature  of  the  data-flow  architecture,  a recursive 
version  of  a computation  is  performed  more  efficiently  than  an  Iterative 
version.  It  can  also  be  argued  that  the  recursive  version  is  semantically 
much  cleaner  and  in  many  cases,  much  easier  to  understand.  Further  research 
is  necessary  to  fully  understand  the  use  of  concurrent  and  recursive  procedure 
activation  for  the  execution  of  cyclic  and  stream-oriented  computation. 
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The  data-flow  representation  and  the  architecture  of  the  data-flow  pro- 
cessor which  have  been  presented  in  this  thesis  are  very  attractive  as  a 
means  of  describing  parallel  computation  and  structuring  a parallel  compu- 
ter system.  The  projected  performance  of  the  data-flow  processor  is  also 
very  attractive,  and  hence,  we  are  hopeful  that  these  concepts  will  prove 
very  useful  in  the  construction  of  future  computer  systems. 
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